Pages 1641-1649
© 2002 Oxford University Press
Modeling the percolation of annotation errors in a database of protein sequences
1 Medical Research Council
Biostatistics Unit, Cambridge
2 Computational Genomics Group, The European
Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, CB10
1SD, UK
3 Statistics Unit, Public Health Laboratory Service,
London, UK
Received on April 5, 2002 ; revised on May 30, 2002 ; accepted on June 6, 2002
Public sequence databases contain information on the sequence,
structure and function of proteins. Genome sequencing projects
have led to a rapid increase in protein sequence information,
but reliable, experimentally verified, information on protein
function lags a long way behind. To address this deficit,
functional annotation in protein databases is often inferred
by sequence similarity to homologous, annotated proteins, with
the attendant possibility of error. Now, the functional
annotation in these homologous proteins may itself have been
acquired through sequence similarity to yet other proteins,
and it is generally not possible to determine how the functional
annotation of any given protein has been acquired. Thus the
possibility of chains of misannotation arises, a process we
term ‘error percolation’. With some simple
assumptions, we develop a dynamical probabilistic model for these
misannotation chains. By exploring the consequences of
the model for annotation quality it is evident that this
iterative approach leads to a systematic deterioration of
database quality.