Pattern Syntax
Sequence patterns can be specified in three different formats:
PROSITE format
- The format was developed for the PROSITE sequence motif database.
- Amino acids are indicated by the standard IUPAC one-letter codes.
-
Nucleotides are indicated by the IUPAC ambiguity codes.
(Note: Special JenaLib addition, the original PROSITE format was developed only for protein sequences.) - The letter 'x' is used for a position where any amino acid is accepted.
- Ambiguities are indicated by listing the acceptable amino acids for a given position.
-
Positive list, in square brackets '[ ]'example:
'[APS]' stands for 'Alanine or Proline or Serine'
- Negative list, in curly brackets '{ }'
example:
'{VR}' stands for 'all amino acids except for Valine and Arginine'
-
Positive list, in square brackets '[ ]'example:
-
Each element in a pattern is separated from its neighbor by a hyphen '-'.
example:
'[APS]-x-{VR}'
(Note: For JenaLib pattern searches the hyphen is not necessary.) - Repetition of an element of the pattern is indicated by adding a numerical value or range in parentheses '( )'.
examples: 'x(3)'
stands for 'any 3 amino acids'
'x(2,4)'
stands for 'any 2 or 3 or 4 amino acids'
'A(3)'
stands for '3 Alanines'
'[APS](3)'
stands for '3 times Alanine or Proline or Serine' (e.g.: Ala-Ala-Pro)
- A pattern that should start at the beginning of the sequence (N-terminus, 5'-end) starts with a '<'.
example:
'<H(6)' stands for 'a sequence that starts with 6 Histidines'
- A pattern that should end at the end of the sequence (C-terminus, 3'-end) ends with a '>'.
example:
'A(3)>' stands for 'a sequence that ends with 3 Alanines'
Using wildcards
- Wildcards are special characters that substitute other characters. They may be familiar to you from file name searches.
- The question mark '?' substitutes any single amino acid or nucleotide.
example:'AAA??PPP' stands for '3 Alanines, followed by any 2 amino acids, followed by 3 Prolines'
- The asterisk '*' substitutes any number of amino acids or nucleotides, even zero.
example:'LLL*CC' stands for '3 Leucines, followed by any number of any amino acid, followed by 2 Cysteines'
Regular expression
- Regular expressions are somehow similiar to the PROSITE format described above.
But they are much more powerful and are used more generally for all kinds of text. - Since regular expressions are a very complex topic it wasn't tried to explain them here. Please look for example at the Wikipedia article on regular expressions as an introduction.
- Internally all other pattern formats (PROSITE, wildcards) are converted into a regular expression.
Appendix
IUPAC one-letter codes for amino acids
Amino Acid | One-letter Code | Three-letter Code |
---|---|---|
Alanine | A | Ala |
Arginine | R | Arg |
Asparagine | N | Asn |
Aspartic Acid | D | Asp |
Cysteine | C | Cys |
Glutamic Acid | E | Glu |
Glutamine | Q | Gln |
Glycine | G | Gly |
Histidine | H | His |
Isoleucine | I | Ile |
Lysine | K | Lys |
Methionine | M | Met |
Phenylalanine | F | Phe |
Proline | P | Pro |
Serine | S | Ser |
Threonine | T | Thr |
Tryptophan | W | Trp |
Tyrosine | Y | Tyr |
Valine | V | Val |
IUPAC ambiguity codes for nucleotides
In contrast to the original ambiguity codes, 'T' and 'U' are not equivalent.
This enables the distinction between Thymine and Uracil within a search.
This enables the distinction between Thymine and Uracil within a search.
Nucleotide(s) | Code |
---|---|
Adenine | A |
Cytosine | C |
Guanine | G |
Thymine | T |
Uracil | U |
A or C | M |
A or C or G | V |
A or C or T or U | H |
A or G | R |
A or G or T or U | D |
A or T or U | W |
C or G | S |
C or G or T or U | B |
C or T or U | Y |
G or T or U | K |
any nucleotide | N |
gap | . |