GENBANK ISSUES

Background

At the FESIN /UNITE meetings in Copenhagen, there was considerable discussion about the high rate of errors in GenBank.  Martin Bidartondo (Imperial College London and Royal Botanical Gardens, Kew) drafted a letter to Science outlining concerns and asking GenBank to consider third party annotation as a potential solution to the problem.  The Fesin web site collected over 250 signatures to the letter (www.sciencemag.org/cgi/content/full/319/5870/161a/DC1) which was published in Science Vol 319:1616 (2008).  A copy of the letter was sent to Gen Bank.  GenBank has responded and declines to allow third party annotation.  Alternate ways of dealing with the problem of incorrectly named sequences will be pursued by FESIN.

References

Brenner, S. TIG 15, 132-133 (1999).

Harris, James D. Can you bank on GenBank? Trends in Ecology and Evolution 18 (7):317-319

Nilsson, R. Henrik, Ryberg, M., Kristiansson, E., Abarenkov, K., Larsson, K-H, Köljalg, U. (2006) Taxonomic reliability of DNA sequences in public sequence databases: A fungal perspective.  PLoS One 1:e59

Gilks, W. R. et al., Bioinformatics 18, 1641-1649 (2002). [see abstract below]


Bioinformatics Vol. 18 no. 12 2002
Pages 1641-1649
© 2002 Oxford University Press

 

Modeling the percolation of annotation errors in a database of protein sequences

Walter R. Gilks 1,*,{dagger}, Benjamin Audit 2,{dagger}, Daniela De Angelis 1,3, Sophia Tsoka 2 and Christos A. Ouzounis 2

1 Medical Research Council Biostatistics Unit, Cambridge
2 Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, CB10 1SD, UK
3 Statistics Unit, Public Health Laboratory Service, London, UK

Received on April 5, 2002 ; revised on May 30, 2002 ; accepted on June 6, 2002


Public sequence databases contain information on the sequence, structure and function of proteins. Genome sequencing projects have led to a rapid increase in protein sequence information, but reliable, experimentally verified, information on protein function lags a long way behind. To address this deficit, functional annotation in protein databases is often inferred by sequence similarity to homologous, annotated proteins, with the attendant possibility of error. Now, the functional annotation in these homologous proteins may itself have been acquired through sequence similarity to yet other proteins, and it is generally not possible to determine how the functional annotation of any given protein has been acquired. Thus the possibility of chains of misannotation arises, a process we term ‘error percolation’. With some simple assumptions, we develop a dynamical probabilistic model for these misannotation chains. By exploring the consequences of the model for annotation quality it is evident that this iterative approach leads to a systematic deterioration of database quality.