Secondary Structure Prediction

Various methods including: Chou & Fasman; GOR; Lim; Neural Networks (e.g. PHD, DSC, SSPRED, PSIPred)






Propensities

The propensity for amino acid i to adopt the helical conformation is calculated as:

fraction of residue i in helix
Pai =
fraction of all residues in helix

e.g. For the amino acid, alanine, if there are 20000 amino acids in the database of which 2000 are alanine and there are 5000 amino acids in helical conformation of which 500 are alanine, then

500 / 2000
Paala = = 1.0
5000 / 20000








Chou & Fasman Method

Biochemistry 13(1974), 211-222

From the propensities, classify the amino acids as:

  Helix   Strand
Strong formers Ha   Hb
Weak formers ha   hb
Indiferent ia   ib
Weak breakers ba   bb
Strong breakers Ba   Bb

1. Search for "nucleation sites":

2. Resolve conflicts

If helix and strand nucleation sites predicted for same residue calculate average Pa and Pb for these residues and choose the higher probability.

3. Extend nucleation sites

Extend in both directions until average probability of the end tetrapeptide falls below 1.0. End residues which are breakers are not included as part of the secondary structure

4. Predict Beta turns

For unassigned residue, calculate a turn probability as the product of four adjacent turn propensities (note that each amino has a different propensity for each of the 4 positions in a turn). If greater than a threshold then predict a turn.





Conceptually elegant method using ideas which are important in real protein folding.

(Variations include introducing a "very weak helix former" class and requiring no breakers to be present in nucleation sites).








Garnier, Osguthorpe, Robson (GOR) Method

J. Mol. Biol. 120(1978), 97-120

The probability of residue i being in structure class s ( a-helix, b-strand, turn, coil) depends on:

From a set of know structures, using a 17 residue window, a 20x17 matrix of frequencies is calculated for each of the four structure classes.

To predict the secondary structure of a new sequence, the probability for each of the four structure classes is calculated based on the central residue and its neighbours from the pre-calculated matrices:

The highest is is then selected for each residue.

The method was refined by Garnier et al., (Meth. Enz. 266(1996),540-553) and by Gibrat et al., (J. Mol. Biol. 198(1987), 425-443) including the introduction of a "decision constant" - a correction factor applied to the is values to give a normal ratio of alpha:beta:turn:coil








Neural Network Based Methods

e.g. Rost & Sander, J. Mol. Biol., 232(1993), 584-599

"Neural networks" are computational equivalents of neurones and synapses in the brain. They consist of a number of very simple processing units ("neurodes") which take one or more inputs and generate a single output by producing a weighted sum of the inputs. These neurodes are connected in a network which is "trained" with a number of inputs and required outputs. The network "learns" the required output for a given input by optimizing the weights. Importantly the system can "generalize" rather than just being able to reproduce the training set.

A typical (very simple) three-layer neural network. Inputs are placed into the first (input) layer and the result is read from the final (output) layer. This network contains a single "hidden" layer.

Typically the input consists of a sliding window of amino acids with the twenty amino acid types being encoded as a 1 or a 0 on each of 20 input neurodes. Thus with a sliding window of 9 residues, an input layer 9x20 neurodes is used. In the case of secondary structure prediction, an output layer of 4 neurodes would be used - one each for a-helix, b-strand, turn and coil. (Frequently only 3 states are predicted; turn and coil are treated as one class.)

The best known secondary structure prediction method using a neural network is PHD. Key to the success of this method is the use of multiple sequences.

When a sequence is entered into the program, the database of known sequence data is searched to find related sequences. These are aligned and used to calculate a conservation-based profile used as input to the neural network.

A related method, PSIPred, uses profiles resulting directly from a database search using PSI-BLAST as input to the network rather than deriving the profile from the multiple alignment.

It is now well established that use of multiple sequence data improves the quality of secondary structure prediction.








Assessment of accuracy of secondary structure prediction

Typically Q3 values are quoted.

where qs is the number of residues in secondary structure type s (a=alpha-helix, b=beta-strand, c=coil) that are correctly predicted and N is the total number of residues.

A random prediction is likely to be around 38% correct.

Single-sequence statistical methods (Chou & Fasman, GOR) have an accuracy of 50-60%

PHD and PSIPred achieve accuracy of around 70% with multiple sequences.