Pattern Syntax
Sequence patterns can be specified in three different formats:
PROSITE format
- The format was developed for the PROSITE sequence motif database.
- Amino acids are indicated by the standard IUPAC one-letter codes.
-
Nucleotides are indicated by the IUPAC ambiguity codes.
(Note: Special JenaLib addition, the original PROSITE format was developed only for protein sequences.) - The letter 'x' is used for a position where any amino acid is accepted.
- Ambiguities are indicated by listing the acceptable amino acids for a given position.
-
Positive list, in square brackets '[ ]'example:
'[APS]' stands for 'Alanine or Proline or Serine'
- Negative list, in curly brackets '{ }'
example:
'{VR}' stands for 'all amino acids except for Valine and Arginine'
-
Positive list, in square brackets '[ ]'example:
-
Each element in a pattern is separated from its neighbor by a hyphen '-'.
example:
'[APS]-x-{VR}'
(Note: For JenaLib pattern searches the hyphen is not necessary.) - Repetition of an element of the pattern is indicated by adding a numerical value or range in parentheses '( )'.
examples: 'x(3)'
stands for 'any 3 amino acids'
'x(2,4)'
stands for 'any 2 or 3 or 4 amino acids'
'A(3)'
stands for '3 Alanines'
'[APS](3)'
stands for '3 times Alanine or Proline or Serine' (e.g.: Ala-Ala-Pro)
- A pattern that should start at the beginning of the sequence (N-terminus, 5'-end) starts with a '<'.
example:
'<H(6)' stands for 'a sequence that starts with 6 Histidines'
- A pattern that should end at the end of the sequence (C-terminus, 3'-end) ends with a '>'.
example:
'A(3)>' stands for 'a sequence that ends with 3 Alanines'
Using wildcards
- Wildcards are special characters that substitute other characters. They may be familiar to you from file name searches.
- The question mark '?' substitutes any single amino acid or nucleotide.
example:'AAA??PPP' stands for '3 Alanines, followed by any 2 amino acids, followed by 3 Prolines' - The asterisk '*' substitutes any number of amino acids or nucleotides, even zero.
example:'LLL*CC' stands for '3 Leucines, followed by any number of any amino acid, followed by 2 Cysteines'
Regular expression
- Regular expressions are somehow similiar to the PROSITE format described above.
But they are much more powerful and are used more generally for all kinds of text. - Since regular expressions are a very complex topic it wasn't tried to explain them here. Please look for example at the Wikipedia article on regular expressions as an introduction.
- Internally all other pattern formats (PROSITE, wildcards) are converted into a regular expression.
Appendix
IUPAC one-letter codes for amino acids
| Amino Acid | One-letter Code | Three-letter Code |
|---|---|---|
| Alanine | A | Ala |
| Arginine | R | Arg |
| Asparagine | N | Asn |
| Aspartic Acid | D | Asp |
| Cysteine | C | Cys |
| Glutamic Acid | E | Glu |
| Glutamine | Q | Gln |
| Glycine | G | Gly |
| Histidine | H | His |
| Isoleucine | I | Ile |
| Lysine | K | Lys |
| Methionine | M | Met |
| Phenylalanine | F | Phe |
| Proline | P | Pro |
| Serine | S | Ser |
| Threonine | T | Thr |
| Tryptophan | W | Trp |
| Tyrosine | Y | Tyr |
| Valine | V | Val |
IUPAC ambiguity codes for nucleotides
In contrast to the original ambiguity codes, 'T' and 'U' are not equivalent.
This enables the distinction between Thymine and Uracil within a search.
This enables the distinction between Thymine and Uracil within a search.
| Nucleotide(s) | Code |
|---|---|
| Adenine | A |
| Cytosine | C |
| Guanine | G |
| Thymine | T |
| Uracil | U |
| A or C | M |
| A or C or G | V |
| A or C or T or U | H |
| A or G | R |
| A or G or T or U | D |
| A or T or U | W |
| C or G | S |
| C or G or T or U | B |
| C or T or U | Y |
| G or T or U | K |
| any nucleotide | N |
| gap | . |