Text Mining for Biomedical Knowledge Discovery

3 Bioinformatics Background

3.3. Biomedical Knowledge Representation and Discovery

3.3.2. Text Mining for Biomedical Knowledge Discovery

Medical and biological studies have evolved to an unprecedented level in recent decades, opening gates to the mechanisms that underlie health and disease in living organisms. It is a major challenge to the research community to incorporate these insights into a unified framework to improve our understanding and decision-making. Effective discovery and development of drugs demand methods to integrate patient data with clinical data, as well as efficient literature mining, to estimate the effectiveness and safety effects of new molecules and treatment strategies. Text mining enables users to update their knowledge on the latest literature, review a wider range of publications, and search for contextual factors that might have become important after the creation of databases. Crucial information on evidence of clinical use of genomic abnormalities is largely reported in biomedical literature. Biocurators, clinicians and oncologists are becoming prohibited from keeping up with the rapidly growing amount of information, particularly the therapeutic implications of biomarkers and therefore relevant for treatment selection. The potential to access the most in-depth information for gene-disease and genotype-phenotype associations is a key factor of precision medicine.

Nevertheless, many of the sources required to understand the relationships between genotype and phenotype include unstructured text which is difficult to analyze.

Text mining, also known as text data mining, is the process of analyzing vast amounts of data from unstructured text, in order to convert it into structured data, derive valuable insights and mine knowledge. In a nutshell, it is to mine within the text for something valuable and bring text into a form that is analyzable.

A common definition of text mining is provided by Hearst [64]:

“Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further by more conventional means of experimentation.”

Text mining uses various computational technologies, including machine learning, natural language processing (NLP) [65], named entity recognition, relationship extraction, and hypothesis generation, to find outcomes hidden in unstructured text. Artificial intelligence (AI) is a computer science field that reveals the need for machine learning (ML) methods in our daily life [66]. Machine learning is the most common approach in the context of text analytics.

It is based on a set of statistical and mathematical techniques to identify different aspects of text. Machine learning approaches can automatically analyze data and can recognize hidden patterns from big data, which a human being cannot find [67].

Biomedical research focuses on studying biological processes and investigating the causes of underlying diseases. It is an interdisciplinary domain bridging molecular biology, genetics, and biochemistry with medicine. The study of how various chemical substances control biological processes helps to develop treatments and find new ways to diagnose diseases.

Literature is an essential way to report and publish experimental results, which makes it rich in a large amount of valuable knowledge. With the enormous amount of biomedical literature and the rapid growth of the number of new publications, a huge wealth of scientific knowledge is scattered in multiple biomedical repositories of textual data and research articles. Typically, biomedical knowledge is largely represented in text using natural language. Extracting relevant information and analyzing text data is helpful to discover relationships between biological entities such as gene-to-disease and disease-to-drug associations and to generate new hypotheses.

The quest for literature is also a core component of finding relevant information in any new scientific discovery process. Due to its exponentially growing scale and interdisciplinary existence, biomedical literature is unique. It is not only a tool to find the newest answers but is also a record of what researchers have concerned themselves with over the decades [68].

Access to biomedical literature is crucial for multiple user types particularly clinicians, bioinformaticians, and biomedical researchers. The biomedical field has been greatly supported by the efforts of the US National Library of Medicine in order to provide bibliographic material for most journal papers, particularly abstracts [69]. PubMed [70] has been the most commonly available research method devoted to life sciences and biomedical literature [71]. PubMed does a thorough job of covering this vast literature, as it contains citations from all around the world.

PubMed is managed by the National Library of Medicine (NLM), which belongs to the US National Institutes of Health (NIH) [72].

The volume of biomedical literature in electronic format is growing with the ease of Internet access [73]. PubMed/MEDLINE® includes millions of citations from MEDLINE®, life

science journals and online books for biomedical literature (Figure 7) [74]. MEDLINE®[75]

is the National Library of Medicine (NLM) journal citation database. PubMed has been the key resource for searching and retrieving biomedical literature electronically since its foundation [76][77].

A large number of biomedical text-mining studies have focused solely on processing abstracts from PubMed. It is advantageous to work with abstracts, since they are publicly and freely available, and summarize the main points of their associated articles, which makes them rich in information content. MeSH® (Medical Subject Headings) [78] is the controlled vocabulary system used to index PubMed articles. The MeSH® terms provide constancy for biomedical literature indexing which facilitates the retrieval of relevant articles. It contains distinct types of terms that help to improve the search results. Additional state-of-the-art biomedical search tools are available that provide indexing biomedical articles such as Embase (Excerpta Medica dataBASE) [73], a database of biomedical citations for multipurpose use, and others that are not commonly used.

Figure 7. The number of indexed¹ citations that have been added to MEDLINE since 1995 in a fiscal year [77].

1 “Indexed citations are those citations selected for MEDLINE that have completed processing and

indexing with current MeSH® (Medical Subject Headings®). Indexed citations have a status of MEDLINE” [77].

Im Dokument Knowledge Integration and Representation for Biomedical Analysis (Seite 35-38)