Clustal W file format
Various programs in the MEME Suite allow as input a file containing a multiple alignment of protein or DNA sequences. These input files must be in CLUSTAL W format (usually identified with the suffix ".aln").
The format is very simple:
- The first line in the file must start with the words "CLUSTAL W" or "CLUSTALW". Other information in the first line is ignored.
- One or more empty lines.
- One or more blocks of sequence data. Each block consists of:
- One line for each sequence in the alignment. Each line consists of:
- the sequence name
- white space
- up to 60 sequence symbols.
- optional - white space followed by a cumulative count of residues for the sequences
- A line showing the degree of conservation for the columns of the alignment in this block.
- One or more empty lines.
Some rules about representing sequences:
- Case doesn't matter.
- Sequence symbols should be from a valid alphabet.
- Gaps are represented using hyphens ("-").
- The characters used to represent the degree of conservation are
* -- all residues or nucleotides in that column are identical : -- conserved substitutions have been observed . -- semi-conserved substitutions have been observed -- no match.
Here is an example of a multiple alignment in CLUSTAL W format:
CLUSTAL W (1.82) multiple sequence alignment
FOSB_MOUSE MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA 60
FOSB_HUMAN MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA 60
************************************************************
FOSB_MOUSE ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS 120
FOSB_HUMAN ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS 120
********************************.***************:*.**:******
FOSB_MOUSE GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT 180
FOSB_HUMAN GGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT 180
****** ***** .**********************************************
FOSB_MOUSE DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD 240
FOSB_HUMAN DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD 240
************************************************************
FOSB_MOUSE LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY 300
FOSB_HUMAN LPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSY 300
****:.******.**************:*:**************************.***
FOSB_MOUSE TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL 338
FOSB_HUMAN TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL 338
***********************:**************
Further information about the CLUSTAL format can be found here