While waiting on several word-counting scripts to finish counting, I picked up my cancerCounter script to count something else. This time, I wanted to see what organism was more popular and more frequently mentioned in biomedical studies: the ever-present Drosophila Melanogaster, aka common fruit fly, or the aptly named Caenorhabditis elegans (one cannot deny that the 1nm-long worm has quite the elegant wiggle). Two model organisms in biomedical research.
Both have a lot going for themselves:
– Elegans was the first organism to ever have its entire genome sequenced (go worm!)
– The worm reproduces and mutates quickly and easily
The fruit fly on the other hand is quite the suitable lab-rat as well:
– Drosophila breeds easily
– Does not need much space nor care
– Has to pay for invading my kitchen each year during summer
I started counting the occurrence of ‘drosophila melanogaster’ or ‘d. melanogaster’ AND ‘caenorhabditis elegans’ or ‘c. elegans’ in the lowercased article-body of my 99.000-and-something BioMedCentral articles-corpus, and took a looksy. First comes the total amount of articles published a year, with the amount of articles mentioning the fruit fly/worm:
As we can see, worryingly, scientists hardly spend enough time performing research with worms and fruit flies. Since 2003, they do consistently play more with the worms than with fruit flies, though. But it’s hard to see, let’s ditch the total articles:
When we subtract the drosophila articles from the elegans articles, we can see how much the worm has on the fruit fly. The red bars represents by how many articles Elegans wins over Drosophila, and blue bars indicate with how many articles Drosophila wins over Elegans.
But absolute numbers is not what we’re looking for. As we have seen in the first graph, the frequency of articles is far from evenly distributed. So let’s see what the ratio is, of the difference between both organisms:
This evens out some of the bigger differences in the previous graph; Drosophila had ‘only’ a +5 win over Elegans in 2001, but relatively this is a bigger victory than Elegans’ +34 win in 2006, and even its +79 victory in 2009.
“The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.”[wikipedia]
It is also the weight I use to measure similarity between texts, for these two tasks of my thesis project (click for pic!):
– Step 3: measure the similarity of a cyttron-db entry to a concept-description from an ontology. This will allow me to find concepts in the text that do not appear literally.
– Step 5: to be able to relate concepts which come from different ontologies. By measuring how similar the text surrounding a concept found in the text is compared to another found concept.
As mentioned before, I am using the excellent Gensim “vector space modelling for humans” package, which takes all the complicated mathematics off my hands (like the scary and intimidating formula up top!). Perfect for me, as I’m not mathematician, nor a computational linguist, nor a statistician, but I AM a human, who wants to work with a solid and proven method of similarity measures and feature extraction for texts. Since I am what I am, I won’t attempt to explain any of the inner workings of Bag-of-word models, vector space, and TF-IDF measures, sorry, there are much better places for that. I’ll simply show how I made Gensim work for me (assuming it does).
The first step is to create a training corpus. The training corpus defines the features of the text – the words that will be considered ‘important’ when looking at a text. The training corpus needs to be from the same domain as the target application: in my case the biomedical domain.
I wrote a simple script using lxml2 to parse the individual files: extracting all plaintext from the article body, cleaning them and storing them in a new text-file (1 article per line) for later processing. The cleaning process consists out of 3 steps: tokenizing articles (aka breaking an article up in words), filtering for common stopwords, and finally stemming the remaining words. I chose to include stemming, in order to unify such words as ‘hippocampal’ and ‘hippocampus’ (stemming returns the ‘root’ of a word). As I stem both the training corpus and the strings that need to be compared, it is not a disaster if words get stemmed incorrectly: in the end I don’t need to make sense out of the stemmed words, I only need them for counting. The plaintext file my script created is 650MB (vs 8,8GB for the uncompressed XML-files)!
The cleaning of the article is pretty straightforward, using pre-cooked NLTK modules: the WordPunct tokenizer, set of English stopwords and NLTK’s implementation of the Porter stemmer. For the quality of the similarity measurement it is important to follow the exact same cleaning procedure with the strings I want to compare – I use the same function for both the corpus-preparation as that of the comparison strings:
stopset = set(stopwords.words('english'))
stemmer = nltk.PorterStemmer()
tokens = WordPunctTokenizer().tokenize(doc)
clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
Creating a training corpus in Gensim
Gensim‘s documentation is very extensive, and I can recommend going through the tutorials if you want to get an idea of the possibilities. But I couldn’t find much documentation on how to do simple string-to-string comparisons, so I wrote down what I did (and errrm yes, it’s pretty much exactly the same as string-to-index querying you can find in the Gensim tutorials :p):
1. Create a ‘dictionary’ of the training corpus’ raw text:
The dictionary contains words:frequency mappings and will be used to convert texts to vector space at a later stage:
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
>>> print dictionary
Dictionary(1049403 unique tokens)
2. Convert the training corpus to vector space:
for line in open('corpus.txt'):
>>> corpus = MyCorpus()
>>> corpora.MmCorpus.serialize('corpus.mm', corpus) # Save corpus to disk
>>> corpus = corpora.MmCorpus('corpus.mm') # Load corpus
>>> print corpus
MmCorpus(99432 documents, 1049403 features, 39172124 non-zero entries)
Thankfully it’s possible to store the generated corpus, dictionary and tfidf to disk: parsing all these documents takes quite a while on my computer. That’s it for the preparation of the training corpus!
Comparing two strings
Now whenever I want to compare two strings, using features gathered from the training corpus, I need to:
Clean both strings in the same way I cleaned the articles in the corpus (NLTK stopword-filter + tokenization) » cleanDoc(string)
Convert both strings to vector-space using the dictionary generated from the training corpus » dictionary.doc2bow(string)
Convert both vector-space representations of the strings to TF-IDF space, using the TF-IDF model initialized earlier » tfidf[string]
When both strings are prepared, all is left to compare them, by creating an ‘index’ (the reference string) and a ‘query’ (the other string). Order doesn’t matter.
index = similarities.MatrixSimilarity([tfidf1],num_features=len(dictionary))
sim = index[tfidf2]
print str(round(sim*100,2))+'% similar'
Resulting in, for example, the comparison of the description of “Alzheimer’s disease” and “Cognitive disease” in the Human Disease (DOID) ontology:
>>> compareDoc("""A dementia that results in progressive memory loss, impaired thinking,
disorientation, and changes in personality and mood starting in late middle age and leads
in advanced cases to a profound decline in cognitive and physical functioning and is marked
histologically by the degeneration of brain neurons especially in the cerebral cortex and
by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is
characterized by memory lapses, confusion, emotional instability and progressive loss of
mental ability.""","""A disease of mental health that affects cognitive functions including
memory processing, perception and problem solving.""")
Or another example: the Wikipedia article of “Alzheimer’s disease” compared to the ontology description of “Alzheimer’s disease”:
alzheimer in wikiTxt
>>> compareDoc(wikiTxt,"""A dementia that results in progressive memory loss, impaired thinking,
disorientation, and changes in personality and mood starting in late middle age and leads in
advanced cases to a profound decline in cognitive and physical functioning and is marked
histologically by the degeneration of brain neurons especially in the cerebral cortex and by
the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized
by memory lapses, confusion, emotional instability and progressive loss of mental ability.""")
Final example: the top 5 most similar ontology concepts to the Wikipedia page of “Alzheimer’s disease”:
Now the second task (of matching a string to all the descriptions from my ontologies is much the same process, with the only difference that I need to use the similarities.Similarity object when creating the index (of the descriptions): the MatrixSimilarity object resides fully in RAM, the Similarity object on disk.
I am pretty confident about these preliminary results. It all seems to work as it should, and should be much more robust than my earlier attempts at similarity measurement using difflib and some crummy homegrown keyword-extraction and comparison (which I will still use for generating synonyms, crumminess works for that).