• Keine Ergebnisse gefunden

Biomedical Embeddings Generated from PubMed/MEDLINE® Abstracts

6 Biomedical Word Embedding

6.2.4. Biomedical Embeddings Generated from PubMed/MEDLINE® Abstracts

We used the word2vev implementation to generate a biomedical embedding using MEDLINE®

abstracts that are linked to the HumanPSD database components. The process of generating embeddings that represent biomedical concepts in a low dimension consists of several steps.

We started at first with a pre-processing phase to clean and normalize the text before feeding it into the training model. We generated two embedding versions for comparison purposes. The process of our word embedding is shown in Figure 44. We developed a pipeline to process the text corpus and to generate word2vec embedding. The implementation methods are based on Gensim [183] which is a Python library for unsupervised topic modeling and natural language processing. Moreover, we developed functions to query and explore biomedical concepts and their relations in the resulting embedding. Based on these functions, we developed a web service to facilitate the exploration using an interactive user interface. We describe in more detail the steps of this work in the following sections.

Figure 44. Embedding development workflow. Text processing starts by reading the abstracts as sentences. Preprocessing strategies are applied during a preprocessing phase. The preprocessed corpus is used for training. The output of the training model consists of the word vectors and the vocabulary of pretrained unique words in the corpus.

83

6.2.4.1. Text Preprocessing

Text data needs to be cleaned and normalized before being fed into machine learning models.

This process is known as ‘Text Pre-processing’. The pre-processing phase of our work consists of multiple steps. It starts by reading text files from a directory, where a directory is given as input/argument.

We used 166 files containing in total 16,558,093 abstracts. An abstract is a summary that provides readers with a quick and direct overview of an article. It is framed from the key statements of the introduction, methods, results, and discussion sections. So, it should entail the important information that can help the reader to learn about a topic of his interest and to infer relevant information.

Each file consists of several abstracts, one abstract per line. The directory of all the files should be the input of the preprocessing phase. The directory can only contain files with format: .bz2, .gz, and text files. The model in Gensim requires that the input must provide sentences sequentially. This means that there is no need to keep everything in memory: we can provide one sentence, process it, forget it, and load another sentence. Then instead of loading everything into an in-memory list, the input is processed file by file, and line by line. This is called an iterator. Any further preprocessing procedures can be done inside the iterator. Words must be already preprocessed, separated by whitespace, and form a corpus that consists of a bunch of word lists to be fed into the word2vec model.

Each abstract is considered one sentence. A sentence is divided into a list of tokens/words. This process is known as ‘Tokenization’. Tokenization is a crucial part of converting text to numerical data that machine learning models need to be trained and make a prediction. The output of “word tokenization” is a list of words for each sentence, it is a format for better understanding and processing text in machine learning applications. The word lists are then provided as input for further cleaning steps. We used the module ‘word_tokenize’ from the NLTK library [184] to split the sentences. NLTK stands for Natural Language Toolkit. It is one of the most powerful Python packages that consists of a set of the most common natural language algorithms like part-of-speech tagging, tokenization, named entity recognition, and sentiment analysis.

The ‘word_tokenize’ module breaks a text fragment by white space and treats punctuation as a separate token as well, which facilitates the removal of punctuation later if desired.

84

Example (sentence source:

https://www.tocris.com/cell-biology#:~:text=Cell%20biology%20is%20the%20study,to%20understanding%20many%20 disease%20states):

>>> [‘Cell biology is the study of the formation,

structure, function, communication and death of a cell’]

>>>[‘Cell’, ‘biology’, ‘is’, ‘the’, ‘study’, ‘of’, ‘the’,

‘formation’, ‘,’, ‘structure’, ‘,’, ‘function’, ‘,’,

‘communication’, ‘and’, ‘death’, ‘of’, ‘a’, ‘cell’]

After this step, the normalization procedures start, and they are mainly lowercase transformation and lemmatization. Changing words to lowercase prevents considering the same word as two different words during training. Lemmatization reduces a word to its morphological root, which is known as the lemma, by removing inflectional endings and considering the context.

In biomedicine, main terms are often represented by several words like ‘zinc finger protein’.

To create a distributed representation for words that captures their meanings, it is important to identify not just single words but also phrases (multi-word).

It is a fundamental step to generate phrases from sentences. In Gensim [183], the ‘Phrases’

module uses a model that is a simple statistical analysis, where phrases are created based on relative counts (n-grams counts) using the following formula:

𝑠𝑐𝑜𝑟𝑒(𝑤

𝑖,

𝑤

𝑗

) = 𝑐𝑜𝑢𝑛𝑡(𝑤

𝑖,

𝑤

𝑗

) − 𝛿

▪ 𝛿 is used as discounting coefficient which avoids the formation of phrases composed of very rare words.

85

It is a scoring function that detects words that appear frequently together using some tunable thresholds to decide some token-pairs (Figure 45). The bigrams (two-word phrases) with a score that exceeds a chosen threshold, are used as phrases. Example:

Figure 45. Bigram identification example. “Bigram” is a function in Gensim to create phrases.

“sentences” is a list of 3 sentences. “min count” is the minimum value of the total collected count of bigrams. “threshold” is the minimum score for a bigram to be taken into account.

“lung cancer” is the identified bigram/phrase which appeared in the 3 sentences.

Subsequently, further cleaning is done by filtering out useless forms and words like stop words, punctuation, and numerical forms (Figure 46). All these filtrations are optional and depend on the purpose of the embedding.

Stop words like “am, a, is, are, this, an, the, etc.” are commonly used words that appear more frequently in text without adding much meaning to a sentence context. Considering them might lead to noisy data and take up memory and time. They can be easily removed by storing a list of these stop words. The NLTK (Natural Language Toolkit) library in python has already a list of stop words stored in different languages.

86 Figure 46. Preprocessing procedures.

Usually, stop words are removed before identifying phrases. In biomedical text, it is a bit challenging when dealing with some stop words since some words can form a part of compound words. For Example, “Vitamin A” is an open compound word where the stop word ‘A’ is an essential stem for the meaning of the word. “Vitamin A” is the name of a group of fat-soluble retinoids that are stored in the liver, including retinol, retinal, and retinyl esters [185][186]. It has its specific roles in maintaining some body functions. Removing ‘A’ leads to a general organic molecule without specifying its type and makes the model classify its context according to the word ‘Vitamin’. In our processing, we considered this procedure as a sensitive case and we dealt with it by changing the conventional chronological preprocessing steps. In order to keep the meaning of the composition of such words, we identified them by generating phrases.

Another case we tackled which is also limited to the intended purpose, is the synonymy. Since one of our aims is to get the biological entities that are similar to other entities in order to derive biological meanings from the relationships between them based on the context, and to uncover hidden relationships, we substituted the synonyms of biological entities of multiple types (gene, disease, drugs, and pathways) by their preferred (main) terms using external resources. This procedure was done during the preprocessing phase before training to narrow the similarity between the words that share a similar context and are not synonyms. For genes, we used HUGO [187], for diseases we used MeSH® [137] preferred terms, and for drugs, we used

87

DrugBank [188][189] terms. This was done for one of the versions. Hence, we generated an embedding version that covers synonyms of biomedical concepts (Embedding_v1), and another version in which synonymous terms of biological entities were excluded (Embedding_v2).