As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics.
A comprehensive overview of the development and uptake of NLP methods applied to free-text clinical notes related to chronic diseases is presented in.
[31] These two tasks are representative of supervised and unsupervised methods, respectively, yet the goal of both is to produce subsets of documents based on their distinguishing features.
IE processes can involve several or all of the above activities, including named entity recognition, relationship discovery, and document classification, with the overall goal of translating text to a more structured form, such as the contents of a template or knowledge base.
For more fine-grained results, some applications permit users to search with natural language queries and identify specific biomedical relationships.
[38] On 16 March 2020, the National Library of Medicine and others launched the COVID-19 Open Research Dataset (CORD-19) to enable text mining of the current literature on the novel virus.
These items include annotated corpora, sources of biomedical research literature, and resources frequently used as vocabulary and/or ontology references, such as MeSH.
[95] These methods are the foundation to facilitate systematic searches of overlooked scientific and biomedical literature which could carry significant association between research.
Methods for determining the association of gene clusters obtained by microarray experiments with the biological context provided by the corresponding literature have been developed.
[citation needed] The search engine PIE was developed to identify and return protein-protein interaction mentions from MEDLINE-indexed articles.
They investigated different domain vocabularies, text representation schemes, and ranking algorithms in order to find the best approach for identifying disease-causing genes to establish a benchmark.
This set of proteins had a manageable size and a rich body of associated information, making it a suitable for the application of text mining tools.
The researchers conducted phrase-mining analysis to cross-examine individual extracellular matrix proteins across the biomedical literature concerned with six categories of cardiovascular diseases.
The text mining study validated existing relationships and informed previously unrecognized biological processes in cardiovascular pathophysiology.
[94] Search engines designed to retrieve biomedical literature relevant to a user-provided query frequently rely upon text mining approaches.
[107] Some search engines, such as Essie,[108] OncoSearch,[109] PubGene,[110][111] and GoPubMed[112] were previously public but have since been discontinued, rendered obsolete, or integrated into commercial products.
Though these records generally include structured components with predictable formats and data types, the remainder of the reports are often free-text and difficult to search, leading to challenges with patient care.
SwellShark[118] is a framework for biomedical NER that requires no human-labeled data but does make use of resources for weak supervision (e.g., UMLS semantic types).
The SparkText framework[119] uses Apache Spark data streaming, a NoSQL database, and basic machine learning methods to build predictive models from scientific articles.
A variety of academic journals publishing manuscripts on biology and medicine include topics in text mining and natural language processing software.