Align, Trim and Filter UCE Sequences and Alignments
This script is an automatized workflow for aligning, triming and filtering alignments, ideally obtained from UCE data, but can be used on any type of alignments.
phyluce_align_seqcap_align
from PhyluceIf you use this program please cite [Calisoga Paper][5]
Calisoga Paper
and Phyluce:
Faircloth BC. 2015. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics. doi: 10.1093/bioinformatics/btv646.
FUSe requires for phyluce to be installed in your system, if phyluce is properly working, FUSe should be able to run without installing any further dependencies. Download or clone this repository and use FUSe.py as a stand-alone program. Alternatevely, you can copy FUSe.py in your phyluce bin folder located in: /User/miniconda#/envs/phyluce-X.x.x/bin/
where '#' is your conda version and "X.x.x" is your version of phyluce. This way, you can call FUSe directly from your phyluce environment.
cp FUSe.py /User/miniconda#/envs/phyluce-X.x.x/bin/
chmod 775 /User/miniconda#/envs/phyluce-X.x.x/bin/FUSe.py
For installing phyluce, detailed intructions can be found here
Command line interface, to deploy the menu and options available type:
python FUSe.py -h
usage: FUSe [-h] -i MONOLITHIC_FILE -t TAXA [-p PREFIX]
[-o {fasta,nexus,phylip,clustal,emboss}] [-c CORES] [--trimAL]
[-a {automated1,gappyout,strict,strictplus,nogaps}] [--gblocks]
[--b1 B1] [--b2 B2] [--b3 B3] [--b4 B4] [--remove-div]
[-d DIVERGENT] [--remove-short] [-s SHORT_CUTOFF]
[--filter-alignments] [-l MIN_LENGTH] [-m MIN_TAXA]
[--get-completeness] [-e PERCENT] [--taxa-count] [-v]
Align, Trimm and Filter UCE Sequences and Alignments.
optional arguments:
-h, --help show this help message and exit
-p PREFIX, --prefix PREFIX
The name to be used for all output folders (default:
OUTPUT).
-o {fasta,nexus,phylip,clustal,emboss}, --out-format {fasta,nexus,phylip,clustal,emboss}
The ouput alignment format (default: fasta).
-c CORES, --cores CORES
The number of PHYSICAL CPUs (default: 1).
--trimAL Wether to trim alignments using trimal.
-a {automated1,gappyout,strict,strictplus,nogaps}, --t-method {automated1,gappyout,strict,strictplus,nogaps}
trimAl automated method to use (default: automated1).
--gblocks Wether to trim alignments using gblocks.
--b1 B1 The GBLOCKS -b1 proportion (default: 0.5)
--b2 B2 The GBLOCKS -b2 proportion (default: 0.70)
--b3 B3 The GBLOCKS -b3 integer value (default: 10)
--b4 B4 The GBLOCKS -b4 integer value (default: 4)
--remove-div Wether to remove divergent sequences from alignments.
-d DIVERGENT, --divergent DIVERGENT
Percentage of pairwise identity in every sequence to
retain (default: 0.7).
--remove-short Wether to remove short sequences from alignments.
-s SHORT_CUTOFF, --short-cutoff SHORT_CUTOFF
Percentage of "-" (gaps) in every sequece to retain
(default: 0.7).
--filter-alignments Wether to filter alignments by No. taxa and length in
bp.
-l MIN_LENGTH, --min-length MIN_LENGTH
The minimum alignment lenght to retain alignment
(default: 50).
-m MIN_TAXA, --min-taxa MIN_TAXA
The minimum number of taxa in alignments to retain
(default: 4).
--get-completeness Wether to filter alignments by completeness.
-e PERCENT, --percent PERCENT
Completeness matrix percentage to output (default:
0.8).
--taxa-count Wether to print taxa count in alignments summary.
-v, --version show program's version number and exit
Required arguments:
-i MONOLITHIC_FILE, --input MONOLITHIC_FILE
The input monolithic fasta file.
-t TAXA, --taxa TAXA The total number of taxa in all alignments.
FUSe takes as input a single Monolithic FASTA file, containing all your sequences for all your loci; when using phyluce for processing UCEs this is usually the resulted file from phyluce_assembly_get_fastas_from_match_counts
.
Alternatevely, if you don't have a monolithic FASTA file but instead individual loci files that needs to be processed with FUSe we can use the next snippet to get a monolithic file:
We will use phyluce_align_explode_alignments
to explode files and get individual fasta files per sample containing all loci asociated to that particular sample, assuming that your loci files are stored in a folder named "myFiles", we will write new files to "myExploded" folder:
phyluce_align_explode_alignments \
--alignments myFiles/ \
--output myExploded \
--input-format fasta \
--by-taxon \
--include-locus
NOTE: Make sure to include flag --include-locus
in order to associate sequences with each locus file, otherwise this will combine all the loci in future steps.
The content of "myExploded" directory should be multiple fasta files, one for each sample in our loci files, next we just need to concatenate all samples in a single monolithic file, in this case named "myMonolithic.fasta":
cat myExploded/*.fasta > myMonolithic.fasta
This monolithich file can be used as input for FUSe
The following are the only required arguments for running the pipeline, if the user provides only those parameters the entire pipeline will run with default parameters (Check Filtering Options section for more details).
The order of the workflow is as listed above, the pipline will always start by attepting to align, followed by trimming and filtering; If the pipeline is run using default parameters the final output of FUSe will be Multiple Sequence Alignments (MSAs). Different steps of the workflow can be avoided but the remaning steps will be performed always in the same order, for example if we donot desired to trim alignments, the pipeline will start aligning followed by removing divergent sequencesand so on.
NOTE: Current version of FUSe doesn't support inputting previuously aligned MSAs for performing filtering or other operations.
In order to activate the different parts of the workflow we need to provide with the desired flags (see more detailed Examples below), for example if we want to align and filter sequences, we will have to provide the flag --filter-alignments
.
Current version of FUSe only supports aligning with MAFFT.
FUSe supports two types of trimming:
Trimming | Options | Default | Reference |
---|---|---|---|
--trimAL |
-a automated1, gappyout, strict, strictplus, nogaps |
-a automated1 |
trimAL |
--gblocks |
-b1 B1, -b2 B2, -b3 B3, -b4 B4 |
-b1 0.5 , -b2 0.7 , -b3 10 , -b4 4 |
Gblocks |
NOTE: Some recent versions of phyluce doesn't support gblocks, this seems to be an issue with new operating systems, if you have a problem runing gblocks try running the pipeline using trimAL.
This option is intended to remove outlier sequences resulted from poor alignment, this could be due to sequences of poor quality, contamination or in some cases due to paralogy. This option is implemented in a similar approach as in the software CIAlign. It calculates pairwise sequence identity across all samples of the same alignment and uses an user defined threshold percentage for removing samples. Defuault parameter (70%) will remove any sequence with a pairwise identity higher than 70, i.e will retain any sequence that is 70% or less divergent that the majority of the alignment. thershold can be modify using the flag --divergent
NOTE: If you are dealing with very divergent samples/species is advised to reduce this parameter accordingly.
This option is intended to remove extemely short sequences from alignments, in the case of UCEs, sometimes some sequences targes only the core region of the UCE and as a result they are extremely short and lack of information for phylogenetics. This options will remove those sequences shorter than a percentage of the total alignment lenght. The function works by calculating the amount of "gaps" ("N" or "-" characters) present in the sequence, for instance the default (70%) will remove any sequence composed of a 70% of ambiguous characters (gaps). thershold can be modify using the flag ---short-cutoff
NOTE: For any of the options above, keep in mind that FUSe will check that after removing sequences there are no columns left containing only "gaps".
This option is intended to filter out entire alignments based on its composition which might not be useful for downstream analysis, by setting a minimum alingmenth lenght (in bp) to retain with --min-length
and by the minimum number of taxa present in your alignments to retain --min-taxa
. Default parameters are --min-length
= 50bp and --min-taxa
= 4 taxa.
This option filter out alignments based on the total number of taxa present in your entire data set and creates "completeness matrix". This option is usually useful for exploring the effect of missing data on your phylogenetic reconstruction, but can have other utilities, default option --percent
= 0.8, this mean will filter out alignments containing less than 80% of your total number of taxa.
FUSe.py also provide with a count of alignments per taxon --taxa-count
. This options is a summary of your final alignments folder and can be used to diagnose any sample with very low number of loci overall, that might result in extreme long branches as a result of missing data. Another way of using this function is to check for taxa counts before getting completeness matrices, this way we can remove low input samples (high missing data) and increase the number of loci in our completeness matrix.
FUSe creates a series of directories to store the files processed during different parts of the workflow, this directories are stored inside a main folder named "OUTPUT-Alignments" (the prefix "OUTPUT" can be changed by using a different prefix with flag -p
). FUSe organize the output according with the order of the workflow listed above, then the first step of the workflow will save the aligned files in "OUTPUT-01-Mafft", trimmed sequences in "OUTPUT-02-trimmed", filtered sequences in "OUTPUT-05-filtered" and so on. If some parts of the workflow are skipped, then no folder for this step will be created.
FUSe works with FASTA formatted files troughout the workflow for convenience of coding, however you can output you final alignments in any of the formats available (fasta,nexus,phylip,clustal,emboss) using the flag -o
. This alignments will be saved in "OUTPUT--Final-Alignments-{format}" where {format} is the file formated desired.
NOTE: Keep in mind that files from previous steps in the workflow will be in FASTA format, this is an expected behavior of the pipeline, aligments can always be converted to specific formats using phyluce_align_convert_one_align_to_another
or any other software.
This is the most basic function in FUSe will read a input monolithic file and will perform aligning of multiple sequences. The output will be a directory with all loci and samples included in your original monolithic file (no filtering).
Assuming you have a monolithic fasta file named my_monolithic.fasta
with 15 loci and 20 samples and you system can run up to 12 cores:
FUSe.py \
-i my_monolithic.fasta
-t 20 \
-p MyOutput \
-c 12
In this command we are telling FUSe.py that ourt data set is composed of 20 taxa (-t
) the output prefix we want to use for all created directories and files is MyOutput
and we want to run the pipeline with 12 cores (-c
).
The output of the command above will be something like this:
.
├── FUSe.log
├── MyOutput-Alignments
│ └── MyOutput-01-mafft
│ ├── uce-1484.fasta
│ ├── uce-1732.fasta
│ ├── uce-1985.fasta
│ ├── uce-2120.fasta
│ ├── uce-2349.fasta
│ ├── uce-2539.fasta
│ ├── uce-3046.fasta
│ ├── uce-4179.fasta
│ ├── uce-553.fasta
│ ├── uce-6095.fasta
│ ├── uce-6187.fasta
│ ├── uce-6550.fasta
│ ├── uce-7014.fasta
│ ├── uce-7542.fasta
│ └── uce-865.fasta
├── my_monolithic.fasta
└── phyluce_align_seqcap_align.log
Here FUSe.log
print all the information of the run including all the flags used and parameters, including default values
This is a good way of diagnosing any issues with the program. phyluce_align_seqcap_align.log
is the log file for phyluce alignment script. hodling crucial information about the alignment step.
NOTE: Given that we only run alignment (the first step out of five in FUSe) we only get the MyOutput-01-mafft
folder.
The opposite extreme will be to include all possible filters available in FUSe, by sample and by locus. Assuming the exact same input monolithic fasta as before us used:
FUSe.py \
-i my_monolithic.fasta -t 20 \
-p MyOutput -c 12 \
--gblocks \
--remove-div \
--remove-short \
--filter-alignments \
--get-completeness
NOTE: The flags here works as a way of "activating" the different stages of the workflow using default parameters, if we are interested in changing such parameters, specific flags have to be provided (see example below).
The output directory structure looks like this:
.
├── FUSe.log
├── MyOutput-Alignments
│ ├── MyOutput-01-mafft
│ ├── MyOutput-02-trimmed
│ ├── MyOutput-03-divergent-removed
│ ├── MyOutput-04-short-removed
│ ├── MyOutput-05-filtered
│ └── MyOutput-80p
├── my_monolithic.fasta
└── phyluce_align_seqcap_align.log
Note that all 5 directories of the all stages of the workflow have been created, with the folder MyOutput-80p
holding the completeness matrix with a default parameter of 80% completeness, once again FUSe.log has all parameters and information of the run.
In the next example we are going to use the entire workflow but we are going to align this time using trimAL with the gappyout
algorithm.
Then we are going to relax the parameters for divergence, allowing a pairwise identity of 0.6, this mean we are keeping samples that are 60% divergent from each other, i.e. with a "higher" degree of disimilarity (for example because we are dealing with very distant related species).
Next we will allowed for sequences as short as the 50% of the total alignment, this means that some sequences will have as much as 50% of their sequence missing (coded as "N" or "-").
We will increase the minimum lenght for retaining alignments to 200bp, this mean that we will filter out any alignment shorter than 200bp.
Finally we will request for a 70% complete matrix in nexus
format.
FUSe.py \
-i my_monolithic.fasta -t 20 \
-p MyOutput -c 12 \
--trimAL -a gappyout \
--remove-div -d 0.6 \
--remove-short -d 0.5 \
--filter-alignments -m 200 \
--get-completeness -e 0.7 \
--taxa-count
contact: roy_monrue@hotmail.com
The code within this repository is available under a 3-clause BSD license. See the LICENSE file for more information.
[5]: (Calisoga Paper reference Here)