In summary we can derive very little useful information directly from a sequence! We can of course make predictions, but there is little of value which can be derived unequivocally from the sequence data.
We can calculate things like profiles based on physico-chemical properties of amino acids such as hydrophobicity ("hydropathy profiles"). However these are of little intrinsic use - their application is in the predictions one can make from them.
In
the case of profiles, one frequently uses a sliding window:
i.e. one calculates the average of a property over a window of an odd number of amino acids (say 9) and assigns the value to the middle member of the window. The window is then slid one position along the sequence and a new value calculated. i.e. For a window size of 9 ( N =4), we first calculate an average value for residues 1-9 and assign this to residue 5; then we do residues 2-10 and assign this to residue 6, etc.
| A large window: | removes noise |
| see major features/trends more easily | |
| lose small/local features/trends | |
| A small window: | noisy data |
| features/trends difficult to see |
All predictions are based on our knowledge of existing protein structures.