Pattern Syntax

Sequence patterns can be specified in three different formats:

The format was developed for the PROSITE sequence motif database.
Amino acids are indicated by the standard IUPAC one-letter codes.
Nucleotides are indicated by the IUPAC ambiguity codes.
(Note: Special JenaLib addition, the original PROSITE format was developed only for protein sequences.)
The letter 'x' is used for a position where any amino acid is accepted.
Ambiguities are indicated by listing the acceptable amino acids for a given position.
- Positive list, in square brackets '[ ]'example:
```
 '[APS]' stands for 'Alanine or Proline or Serine' 
```
- Negative list, in curly brackets '{ }' example:
```
 '{VR}' stands for 'all amino acids except for Valine and Arginine'
```
Each element in a pattern is separated from its neighbor by a hyphen '-'. example:
```
 '[APS]-x-{VR}'
```
(Note: For JenaLib pattern searches the hyphen is not necessary.)

Repetition of an element of the pattern is indicated by adding a numerical value or range in parentheses '( )'.

examples:	'x(3)'	stands for 'any 3 amino acids'
	'x(2,4)'	stands for 'any 2 or 3 or 4 amino acids'
	'A(3)'	stands for '3 Alanines'
	'[APS](3)'	stands for '3 times Alanine or Proline or Serine' (e.g.: Ala-Ala-Pro)

A pattern that should start at the beginning of the sequence (N-terminus, 5'-end) starts with a '<'. example:
```
 '<H(6)' stands for 'a sequence that starts with 6 Histidines'
```
A pattern that should end at the end of the sequence (C-terminus, 3'-end) ends with a '>'. example:
```
 'A(3)>' stands for 'a sequence that ends with 3 Alanines'
```

Wildcards are special characters that substitute other characters. They may be familiar to you from file name searches.

The question mark '?' substitutes any single amino acid or nucleotide.
example:

 'AAA??PPP' stands for '3 Alanines, followed by any 2 amino acids, followed by 3 Prolines'

The asterisk '*' substitutes any number of amino acids or nucleotides, even zero.
example:

 'LLL*CC' stands for '3 Leucines, followed by any number of any amino acid, followed by 2 Cysteines'

Regular expressions are somehow similiar to the PROSITE format described above.
But they are much more powerful and are used more generally for all kinds of text.
Since regular expressions are a very complex topic it wasn't tried to explain them here. Please look for example at the Wikipedia article on regular expressions as an introduction.
Internally all other pattern formats (PROSITE, wildcards) are converted into a regular expression.

In contrast to the original ambiguity codes, 'T' and 'U' are not equivalent.
This enables the distinction between Thymine and Uracil within a search.