[1] A biocurator is a professional scientist who curates, collects, annotates, and validates information that is disseminated by biological and model organism databases.
[9][10] In 2011, biocuration was already recognized as a profession, but there were no formal degree courses to prepare curators for biological data in a targeted fashion.
Some notable examples of model organism databases are FlyBase,[19] PomBase,[20] and ZFIN,[21] dedicated to curate information about Drosophila, Schizosaccharomyces pombe and zebrafish respectively.
Biocuration is the integration of biological information into on-line databases in a semantically standardized way, using appropriate unique traceable identifiers, and providing necessary metadata including source and provenance.
These domains include genomics and proteomics, anatomy, animal and plant development, biochemistry, metabolic pathways, taxonomic classification, and mutant phenotypes.
For example, the gene ontology (GO) curates terms for biological processes, which are used to describe what we know about specific genes.As of 2021, life sciences communication is still done primarily via free natural languages, like English or German, which hold a degree of ambiguity and make it hard to connect knowledge.
[59][60] However, most databases offer highly structured data that is searchable in complex combinations, which is usually not possible on Wikipedia, although Wikidata aims at solving this problem to some extent.
A few examples are: Natural-language processing and text mining technologies can help biocurators to extract of information for manual curation.
[80] Text mining can scale curation efforts, supporting the identification of gene names, for example, as well as for partially inferring ontologies.
[81][82] The conversion of unstructured assertions to structured information makes use of techniques like named entity recognition and parsing of dependencies.
[83] Text-mining of biomedical concepts faces challenges regarding variations in reporting, and the community is working to increase the machine-readability of articles.
[84] During the COVID-19 pandemic, biomedical text mining was heavily used to cope with the large amount of published scientific research about the topic (over 50.000 articles).
[85] The popular NLP python package SpaCy has a modification for biomedical texts, SciSpaCy, which is maintained by the Allen Institute for AI.
[87] A complementary approach to biocuration via text mining involves applying optical character recognition to biomedical figures, coupled to automatic annotation algorithms.
[88] Suggestions to improve the written text to facilitate annotations range from using controlled natural languages[89] to providing clear association of concepts (such as genes and proteins) with the particular species of interest.
[91] The main goal of the challenge is to foster the development of advanced computational tools that can effectively extract information from the vast amount of biological data available.
The BioCreative Challenge is organized into several subtasks that cover various aspects of text mining and information extraction in the life sciences.
Participants in the challenge are provided with a set of annotated data to develop and test their systems, and their performance is evaluated based on various metrics, such as precision, recall, and F-score.
[91] The BioCreative Challenge has led to the development of many innovative text mining and information extraction systems that have greatly improved the efficiency and accuracy of biocuration efforts.