来源:bioinformatics.psb.ugent.be Installing ForCon
After downloading ForCon, you have a .zip file. Unzip it using WinZip or PKUNZIP in a temporary directory. Double-click the setup.exe file and follow the instructions.
General description
ForCon is a user-friendly software tool for the easy conversion of nucleic and amino acid sequence alignments into different formats.
At the moment, ForCon is able to convert ? in both ways, i.e. reading and writing - the following formats (or formats used by the following software packages):
- CLUSTAL
- EMBL
- FASTA
- GCG/MSF
- Hennig86
- MEGA
- NBRF/PIR
- PAUP/Nexus
- Parsimony Jackknifer
- PHYLIP
- TREECON
Software packages not included in the list are usually able to read one of the formats mentioned. For the publication of sequence alignments, a format with codon positions can be generated ("Pretty").
Sequential and interleaved formats are supported by ForCon. (see next paragraph) File formats
The use of correct formats is extremely important: incorrect formats cannot be correctly interpreted by the program. For this reason a description and example of all the formats is presented below.
Overall, two major types of formats exist: interleaved and noninterleaved (sequential). In the interleaved format, sequences are written in the form of an alignment:
Usually the symbol for missing data is 'N' (nucleotides) or 'X' (proteins). For insertions/deletions ('gaps') the most commonly used symbol is a hyphen '-'.
Regarding the different formats:
1) CLUSTAL
The CLUSTAL program is a program for creating sequence alignments. The CLUSTAL format can be described as follows:
- the word CLUSTAL should be on the first non-space line of the file - the alignment is displayed in blocks of a fixed length - each line in the block corresponds to one sequence - the line starts with the sequence name (of any length), followed by at least one space character - then the sequence itself is displayed (upper- or lowercase) ( '-' : gaps ) (optional : residue number at the end) - in between blocks: line with conservation info ( ForCon only writes stars for now ; for more info: https://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/#G )
Example :
2) EMBL
The EMBL database is the primary nucleotide database in Europe. The format is described in detail at: https://www.ebi.ac.uk/ebi_docs/embl_db/usrman/structure_entry.html
Multiple sequence files also follow these rules. They are separated by the '//' that ends each entry. Only the information used in multiple sequence alignments is used by ForCon.
Example ( as generated by ForCon; for input, any EMBL file is allowed ):
2) EMBL
The EMBL database is the primary nucleotide database in Europe. The format is described in detail at: https://www.ebi.ac.uk/ebi_docs/embl_db/usrman/structure_entry.html
Multiple sequence files also follow these rules. They are separated by the '//' that ends each entry. Only the information used in multiple sequence alignments is used by ForCon.
Example ( as generated by ForCon; for input, any EMBL file is allowed ):
3) FASTA
The FASTA program is used for database searches. The format is described at : https://www.ncbi.nlm.nih.gov/BLAST/fasta.html
Example:
4) GCG/MSF
The Multiple Sequence File format by the Genetics Computer Group Wisconsin package is thoroughly described in their user manual. In brief:
- on the first line : file type identifier like '!!AA_MULTIPLE_ALIGNMENT 1.0', '!!NA_MULTIPLE_ALIGNMENT 1.0' or 'PileUp'. ( optional ) - second line: optional title/description - dividing line with obligatory 'MSF: sequence length', checksum value and two points '..' - name/weight section with checksum - separating line : // - alignment : interleaved
Example ( as generated by ForCon )
5) Hennig86
The parsimony phylogeny program by Farris uses an unusual format: the different IUPAC nucleotide letter codes are replaced by a number code. ForCon uses the following standard translation :
A |
to: |
0 |
U,T |
to: |
1 |
G |
to: |
2 |
C |
to: |
3 |
N |
to: |
? |
When converting from the Hennig86 format, the user will be prompted to enter his/her own translation preferences. The format is a sequential format. On the first line there is the word 'xread', used for recognition of the file. On the following line a title/description can be placed in between single quotes. The third line consists of the sequence length and the number of sequences. After the alignment ( is sequential format ), the file is closed by a semicolon (;). The symbol used for missing data is '?'. There is no separate character for defining gaps.
Example:
6) MEGA
The Molecular Evolutionary Genetic Analysis program by Kumar, Tamura & Nei is a tree construction program based on distance- and parsimony methods. The format is described in the MEGA manual. In brief: The format exists in the interleaved and noninterleaved format. Disregarding the format type, the file always starts with the word '#mega' on the first line. On the following line, a title can be stated, preceded by the term 'TITLE:'. In between the title and the sequence data, a description or extra comments can be placed. Even inside the sequences, comments are allowed in between quotes (""). The sequence names are preceded by a '#'.
Examples:
#mega TITLE: Four Anthropoidea
The interleaved format
#Homo_sapiens AGUCGAGUC---GCAGAAACGCAUGAC-GACC #Pan_paniscus AGUCGCGUCG--GCAGAAACGCAUGACGGACC #Gorilla_gorilla AGUCGCGUCG--GCAGAUACGCAUCACGGAC- #Pongo_pigmaeus AGUCGCGUCGAAGCAGA--CGCAUGACGGACC
#Homo_sapiens ACAUUUU-CCUUGCAAAG #Pan_paniscus ACAUCAU-CCUUGCAAAG #Gorilla_gorilla ACAUCAUCCCUCGCAGAG #Pongo_pigmaeus ACAUCAUCCCUUGCAGAG
---
#mega TITLE: Four Anthropoidea
The noninterleaved format
#Homo_sapiens AGUCGAGUC---GCAGAAACGCAUGAC-GACCACAUUUU-CCUUGCAAAG #Pan_paniscus AGUCGCGUCG--GCAGAAACGCAUGACGGACCACAUCAU-CCUUGCAAAG #Gorilla_gorilla AGUCGCGUCG--GCAGAUACGCAUCACGGAC-ACAUCAUCCCUCGCAGAG #Pongo_pigmaeus AGUCGCGUCGAAGCAGA--CGCAUGACGGACCACAUCAUCCCUUGCAGAG
7) NBRF/PIR
The format of this large protein database is similar to the FASTA format. Each sequence, though, starts with a '>[sequence type code];', followed by the sequence name and a description ( on the next line ). This description is ignored by ForCon. On the following line the actual sequence is written and is ended with an asterisk (*).
The sequence type codes are as follows:
Code |
Sequence type |
P1 |
Protein (complete) |
F1 |
Protein (fragment) |
DL |
DNA (linear) |
DC |
DNA (circular) |
RL |
RNA (linear) |
RC |
RNA (circular) |
N3 |
tRNA |
N1 |
other functional RNA |
ForCon accepts all these codes, but only writes down codes P1, D1 and RL.
Example :
>RL;Homo sapiens Homo sapiens RNA sequence AGUCGAGUC---GCAGAAACGCAUGAC-GACCACAUUUU-CCUUGCAAAG* >RL;Pan paniscus Pan paniscus RNA sequence AGUCGCGUCG--GCAGAAACGCAUGACGGACCACAUCAU-CCUUGCAAAG* >RL;Gorilla gorilla Gorilla gorilla RNA sequence AGUCGCGUCG--GCAGAUACGCAUCACGGAC-ACAUCAUCCCUCGCAGAG* >RL;Pongo pigmaeus Pongo pigmaeus RNA sequence AGUCGCGUCGAAGCAGA--CGCAUGACGGACCACAUCAUCCCUUGCAGAG*
8) PAUP/NEXUS
The Nexus format is used by several programs: PAUP, MacClade, Spectrum,... . For a detailed description of the format, I'd like to refer to the article written by Maddison et al. :
Maddison, D.R., Swofford, D.L., Maddison, W.P. (1997) NEXUS: An extendible file format for systematic information. Syst.Biol. 46, 590-621.
ForCon is limited in the use of this extremely versatile format. Only the information on the alignment itself is used and generated, although any NEXUS file can be used as input file. The program will ignore all information that is not used. Here is an example of a NEXUS file generated by the ForCon program:
#NEXUS [TITLE: Four Anthropoidea]
begin data; dimensions ntax=4 nchar=50; format datatype=RNA missing=N gap=-;
matrix Homo_sapiens AGUCGAGUC---GCAGAAACGCAUGAC-GACCACAUUUU-CCUUGCAAAG Pan_paniscus AGUCGCGUCG--GCAGAAACGCAUGACGGACCACAUCAU-CCUUGCAAAG Gorilla_gorilla AGUCGCGUCG--GCAGAUACGCAUCACGGAC-ACAUCAUCCCUCGCAGAG Pongo_pigmaeus AGUCGCGUCGAAGCAGA--CGCAUGACGGACCACAUCAUCCCUUGCAGAG ; endblock; begin assumptions; options deftype=unord;
---
#NEXUS [TITLE: Four Anthropoidea]
begin data; dimensions ntax=4 nchar=50; format interleave datatype=RNA missing=N gap=-;
matrix Homo_sapiens AGUCGAGUC---GCAGAAACGCAUGAC-GAC Pan_paniscus AGUCGCGUCG--GCAGAAACGCAUGACGGAC Gorilla_gorilla AGUCGCGUCG--GCAGAUACGCAUCACGGAC Pongo_pigmaeus AGUCGCGUCGAAGCAGA--CGCAUGACGGAC
Homo_sapiens CACAUUUU-CCUUGCAAAG Pan_paniscus CACAUCAU-CCUUGCAAAG Gorilla_gorilla -ACAUCAUCCCUCGCAGAG Pongo_pigmaeus CACAUCAUCCCUUGCAGAG ;
endblock; begin assumptions; options deftype=unord;
|