Wordle van de 200 meest voorkomende woorden in tweets met hashtag #OccupyAmsterdam. Gemaakt van 5.239 tweets van tussen zaterdag 8 oktober 09:55 uur en 16 oktober 15:50 uur.
Handmatig gefilterd op nicknames en nietszeggende woorden. Hier is de lijst van de 1000 meest voorkomende woorden: OccupyAmsterdam-woorden.
Wordle van de 200 meest voorkomende woorden in tweets met hashtag #OccupyAmsterdam. Gemaakt van 5.239 tweets van tussen zaterdag 8 oktober 09:55 uur en 16 oktober 15:50 uur.
Computing string similarity with TF-IDF and Python
“The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.”[wikipedia]
It is also the weight I use to measure similarity between texts, for these two tasks of my thesis project (click for pic!):
- Step 3: measure the similarity of a cyttron-db entry to a concept-description from an ontology. This will allow me to find concepts in the text that do not appear literally.
- Step 5: to be able to relate concepts which come from different ontologies. By measuring how similar the text surrounding a concept found in the text is compared to another found concept.
As mentioned before, I am using the excellent Gensim “vector space modelling for humans” package, which takes all the complicated mathematics off my hands (like the scary and intimidating formula up top!). Perfect for me, as I’m not mathematician, nor a computational linguist, nor a statistician, but I AM a human, who wants to work with a solid and proven method of similarity measures and feature extraction for texts. Since I am what I am, I won’t attempt to explain any of the inner workings of Bag-of-word models, vector space, and TF-IDF measures, sorry, there are much better places for that. I’ll simply show how I made Gensim work for me (assuming it does).
The first step is to create a training corpus. The training corpus defines the features of the text – the words that will be considered ‘important’ when looking at a text. The training corpus needs to be from the same domain as the target application: in my case the biomedical domain.
At first I was looking at extracting a bunch of relevant Wikipedia articles (all articles from Wikipedia’s Biology category) to use as a training corpus. But then I came across something better: the Open Access BioMed Central full-text corpus. The corpus consists of over 100.000 articles, weighing in at 8GB of XML-documents.
I wrote a simple script using lxml2 to parse the individual files: extracting all plaintext from the article body, cleaning them and storing them in a new text-file (1 article per line) for later processing. The cleaning process consists out of 3 steps: tokenizing articles (aka breaking an article up in words), filtering for common stopwords, and finally stemming the remaining words. I chose to include stemming, in order to unify such words as ‘hippocampal’ and ‘hippocampus’ (stemming returns the ‘root’ of a word). As I stem both the training corpus and the strings that need to be compared, it is not a disaster if words get stemmed incorrectly: in the end I don’t need to make sense out of the stemmed words, I only need them for counting. The plaintext file my script created is 650MB (vs 8,8GB for the uncompressed XML-files)!
The cleaning of the article is pretty straightforward, using pre-cooked NLTK modules: the WordPunct tokenizer, set of English stopwords and NLTK’s implementation of the Porter stemmer. For the quality of the similarity measurement it is important to follow the exact same cleaning procedure with the strings I want to compare – I use the same function for both the corpus-preparation as that of the comparison strings:
def cleanDoc(doc): stopset = set(stopwords.words('english')) stemmer = nltk.PorterStemmer() tokens = WordPunctTokenizer().tokenize(doc) clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2] final = [stemmer.stem(word) for word in clean] return final
Creating a training corpus in Gensim
Gensim‘s documentation is very extensive, and I can recommend going through the tutorials if you want to get an idea of the possibilities. But I couldn’t find much documentation on how to do simple string-to-string comparisons, so I wrote down what I did (and errrm yes, it’s pretty much exactly the same as string-to-index querying you can find in the Gensim tutorials :p):
1. Create a ‘dictionary’ of the training corpus’ raw text:
The dictionary contains words:frequency mappings and will be used to convert texts to vector space at a later stage:
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt')) >>> print dictionary Dictionary(1049403 unique tokens)
2. Convert the training corpus to vector space:
class MyCorpus(object): def __iter__(self): for line in open('corpus.txt'): yield dictionary.doc2bow(line.lower().split()) >>> corpus = MyCorpus() >>> corpora.MmCorpus.serialize('corpus.mm', corpus) # Save corpus to disk >>> corpus = corpora.MmCorpus('corpus.mm') # Load corpus >>> print corpus MmCorpus(99432 documents, 1049403 features, 39172124 non-zero entries)
3. Initialize the TF-IDF model:
>>> tfidf = models.TfidfModel(corpus) >>> print tfidf TfidfModel(num_docs=99432, num_nnz=39172124)
Thankfully it’s possible to store the generated corpus, dictionary and tfidf to disk: parsing all these documents takes quite a while on my computer. That’s it for the preparation of the training corpus!
Comparing two strings
Now whenever I want to compare two strings, using features gathered from the training corpus, I need to:
- Clean both strings in the same way I cleaned the articles in the corpus (NLTK stopword-filter + tokenization) » cleanDoc(string)
- Convert both strings to vector-space using the dictionary generated from the training corpus » dictionary.doc2bow(string)
- Convert both vector-space representations of the strings to TF-IDF space, using the TF-IDF model initialized earlier » tfidf[string]
When both strings are prepared, all is left to compare them, by creating an ‘index’ (the reference string) and a ‘query’ (the other string). Order doesn’t matter.
index = similarities.MatrixSimilarity([tfidf1],num_features=len(dictionary)) sim = index[tfidf2] print str(round(sim*100,2))+'% similar'
Resulting in, for example, the comparison of the description of “Alzheimer’s disease” and “Cognitive disease” in the Human Disease (DOID) ontology:
>>> compareDoc("""A dementia that results in progressive memory loss, impaired thinking, disorientation, and changes in personality and mood starting in late middle age and leads in advanced cases to a profound decline in cognitive and physical functioning and is marked histologically by the degeneration of brain neurons especially in the cerebral cortex and by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized by memory lapses, confusion, emotional instability and progressive loss of mental ability.""","""A disease of mental health that affects cognitive functions including memory processing, perception and problem solving.""") 23.29% similar
Or another example: the Wikipedia article of “Alzheimer’s disease” compared to the ontology description of “Alzheimer’s disease”:
>>> wikiGet('alzheimer') alzheimer in wikiTxt >>> compareDoc(wikiTxt,"""A dementia that results in progressive memory loss, impaired thinking, disorientation, and changes in personality and mood starting in late middle age and leads in advanced cases to a profound decline in cognitive and physical functioning and is marked histologically by the degeneration of brain neurons especially in the cerebral cortex and by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized by memory lapses, confusion, emotional instability and progressive loss of mental ability.""") 31.95% similar
Final example: the top 5 most similar ontology concepts to the Wikipedia page of “Alzheimer’s disease”:
>>> descMatch(wikiAlz) Label: Alzheimer's disease Similarity: 31.9990843534 Label: vascular dementia Similarity: 28.0893445015 Label: amyloid deposition Similarity: 25.6860613823 Label: cognitive disease Similarity: 18.7662974 Label: dementia Similarity: 18.0801317096
Now the second task (of matching a string to all the descriptions from my ontologies is much the same process, with the only difference that I need to use the similarities.Similarity object when creating the index (of the descriptions): the MatrixSimilarity object resides fully in RAM, the Similarity object on disk.
I am pretty confident about these preliminary results. It all seems to work as it should, and should be much more robust than my earlier attempts at similarity measurement using difflib and some crummy homegrown keyword-extraction and comparison (which I will still use for generating synonyms, crumminess works for that).
Simple keyword extraction in Python: choices, choices.
As explained in an earlier post, I am working on a simple method of extracting ‘important words’ from a text-entry. The methods I am using at the moment are frequency distributions and word collocations. I’ve bumped into some issues regarding finetuning my methods. Read on for a short explanation of my approaches, and some issues regarding them.
Frequency Distribution: POS-tagging y/n?
Extracting keywords by frequency distribution is nothing more than counting words and sorting the list of words by occurrence. Before doing this, I filter stopwords from the text entry. The short explanation on how I’m doing this (sourcecode available at github):
» Tokenize the text (using NLTK’s WordPunctTokenizer)
» Lowercase all the words
» ‘Clean’ the list by removing common stopwords from the list (using NLTK’s English stopwords-list)
This is straightforward enough, an example of the results (from the WikiPedia page of ‘Apoptosis‘):
>>> cyttron.wikiGet('Apoptosis') Apoptosis in wikiTxt >>> freqWords(cyttron.wikiTxt,15) ['apoptosis', 'cell', '160', 'apoptotic', 'cells', 'caspase', 'death', '.&#', 'proteins', 'tnf', 'bcl', 'protein', 'also', 'caspases']
Earlier I was thinking about using POS-tagging (Part-Of-Speech tagging to identify word-types) in order to only extract frequently occurring nouns. I figured losing relevant adjectives (such as ‘red’ in red blood cell) could be compensated by the word collocations extraction. POS-tagging the tokenized text, and retrieving only the most frequent nouns results in:
>>> freqNouns(cyttron.wikiTxt,15) ['apoptosis', 'cell', 'caspase', 'death', 'protein', 'tnf', 'pathway', 'activation', 'membrane', 'p53', 'response', 'family', 'gene', 'greek']
My problem here is I’m not sure which is ‘better’ (if any of those two), or if I should maybe use a combination of both. Also, I haven’t decided yet how to handle non-alphabetic words. Initially I planned on using regular expressions to filter non-alphabetic strings, but I figured later that it wouldn’t make sense in my case. In the above example, this would omit retrieving ‘p53′: a tumor suppressor protein (P53), which is very relevant.
With earlier playing around with POS-tagging I noticed the precision was not quite high enough to perform chunk extractions (by looking for specific phrases / grammatical constructions). Extracting only nouns does seem to do quite the job, even if I still miss some and get some false positives.
Word Collocations: Stopword filtering y/n?
Collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. I generate bi- and trigram word collocations, which mean ’2-word strings’ and ’3-word strings’. My issue here is whether or not to use stopword filtering. Here are the results of the word collocation function on the same WikiPedia page, the 1st list being the bigram collocations, the 2nd being the trigrams. Example without stopword filtering:
>>> wordCollo(cyttron.wikiTxt,10,clean=False) ['such as', 'cell death', 'of the', 'due to', 'leads to', 'programmed cell', 'has been', 'bone marrow', 'have been', 'an increase'] ['adp ribose polymerase', 'amino acid composition', 'anatomist walther flemming', 'boston biologist robert', 'break itself down', 'combining forms preceded', 'count falls below', 'german scientist carl', 'homologous antagonist killer', 'mdm2 complexes displaces']
Example with stopword filtering:
>>> wordCollo(cyttron.wikiTxt,10,clean=True) ['cell death', 'programmed cell', 'bone marrow', 'university aberdeen', 'calcium concentration', 'adenovirus e1b', 'british journal', 'citation needed', 'highly conserved', 'nitric oxide'] ['adp ribose polymerase', 'agar gel electrophoresis', 'amino acid composition', 'anatomist walther flemming', 'appearance agar gel', 'awarded sydney brenner', 'boston biologist robert', 'carl vogt first', 'ceases respire aerobically', 'closely enough warrant']
As you can see, lots of garbage in the first example, but still some collocations that do not appear in the cleaned version. Similar to the noun-extraction issue with the previous approach, I wonder if I should choose for one of the two, or combine them.
In other news
Finding Gensim has been a life-saver! Instead of using Difflib to compare two strings, I now use a proper text-similarity metric, namely cosine similarity measurement. I do so by creating a TF-IDF weighted corpus out of the (stopwords-cleaned) descriptions of ontology-terms I use, and calculating the cosine similarity between an input string and each entry in the corpus. Gensim makes this all a breeze to do. An example of the ouput:
>>> wikiGet('alzheimer') alzheimer in wikiTxt >>> descMatch(wikiTxt,5) Label: Alzheimer's disease Similarity: 0.236387 Description: A dementia that results in progressive memory loss, impaired thinking, disorientation, and changes in personality and mood starting in late middle age and leads in advanced cases to a profound decline in cognitive and physical functioning and is marked histologically by the degeneration of brain neurons especially in the cerebral cortex and by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized by memory lapses, confusion, emotional instability and progressive loss of mental ability. Label: vascular dementia Similarity: 0.192565 Description: A dementia that involves impairments in cognitive function caused by problems in blood vessels that feed the brain. Label: dementia Similarity: 0.157553 Description: A cognitive disease resulting from a loss of brain function affecting memory, thinking, language, judgement and behavior. Label: cognitive disease Similarity: 0.13909 Description: A disease of mental health that affects cognitive functions including memory processing, perception and problem solving. Label: encephalitis Similarity: 0.138719 Description: Encephalitis is a nervous system infectious disease characterized as an acute inflammation of the brain. The usual cause is a viral infection, but bacteria can also cause it. Cases can range from mild to severe. For mild cases, you could have flu-like symptoms. Serious cases can cause severe headache, sudden fever, drowsiness, vomiting, confusion and seizures.
I’m not sure if the similarity numbers it produces indicate I’m doing something wrong (there’s no high similarity), but intuitively I would say the results do make sense.
Graduation pt. 4: What’s next
Just a quick update to let myself know what’s going to happen next: It’s time to produce some results! While I was getting quite stuck in figuring out the best – or rather, most practical – way to extract keywords from a text (and not just any text, mind you, but notes of biologists), my supervisor told me it’s time to get some results. Hard figures. I decided to scrap POS-tagging the notes to extract valuable phrases, after I noticed the accuracy of the default NLTK POS-tagger was way below practical usage. Not too surprising, considering the default NLTK tagger is probably not trained on biologists’ notes.
Anyway, we came up with the following tests:
I use two sources:
- The first being the biologist’s notes (the Cyttron DB).
- The second being specific Wikipedia pages on recurring topics of the Cyttron DB:
Alzheimer, Apoptis, Tau protein & Zebrafish.
From these two sources, I will use five different methods of representing the information:
- Literal representation (using each word, no edit)
- Simple keyword extraction (using word frequency after subtracting english stopwords)
- Bigram collocations
- Trigram collocations
- Keyword combo (word frequency + bigrams + trigrams)
Each of these ways of representing the source information can then be ‘boosted’ by using WordNet to generate synonyms, doubling the ways of representing the data (2×5=10!).
With these 10 different representations of each of the two datasources (2×10), I will use 3 ways to try to determine the subject:
- Literal label matching using two different sets of ontologies:
- Cyttron-set: Gene Ontology, Human Disease Ontology, Mouse Pathology & National Cancer Institute Thesaurus
- DBPedia ontology
- Matching the sources to descriptions of ontologyterms, using the same two sets of ontologies.
- If I manage: Matching the datasources to ‘context‘ of ontology terms.
I started working on a method to take a term in an ontology and explore its surrounding nodes. I will collect all ‘literals’ attached to a node, and throw them in a big pile of text. I will then use this pile of text as a bag of words, to match to the datasources.
This will bring the total amount of tests to be done to 120:
2 sources (wiki/cyttron) 10 representations of these sources 3 methods (literal/desc/context) 2 ontologies (cyttron-set/dbpedia) ------------------------------------- 2 * 10 * 3 * 2 = 120
Currently I am working on my final project of the Media Technology MSc. Programme of Leiden University. With the goal of structuring my thoughts and process so far, and because I’ve promised on Twitter, I decided to write a small and simple summary of what my project is about, how I got here and what I’m expecting to do in the next 2-3months. If you want to jump ahead to what my project is about, jump to here.
A short history of my Media Technology graduation project
The idea of a graduation project for this particular master’s programme is to come up with your own inspiration to conduct a small autonomous research project. As Media Technology resides under the Leiden Institute of Advanced Computer Science faculty, using ‘computer science’ as a tool in your research is not very uncommon.
After finalizing the last couple of courses, I started out looking for inspiration for a research project. From a previous course I came into contact with (low-level) text analysis tasks, using the Python programming language and NLTK (Natural Language ToolKit, a very cool, free and open-source text analysis ‘swiss army knife’). I became interested in the possibilities of (statistical) text analysis. I liked the idea of using simple tools to perform research on the web, so I started looking at the features of NLTK and different Natural Language Processing techniques to include semantics in “web-research”. Having found these starting points, it was time to formulate research proposals.
My initial proposal was not very well fleshed out, more of a way to let the Media Technology board know what I was looking at, and basically to receive a go for the actual work (which to me still was to define my actual project). The proposal involved crawling lots of blogs to perform small scale analyses on, using low-level NLP techniques to go beyond simple statistics and wordfrequency-type research – to include meaning and semantics. The board decided my proposals were concrete enough to approve.
Another part of sending in my proposals and going ahead with the project was finding a supervisor. From a course on AI I took last year I remembered a PhD Student at Leiden University, who was involved/interested in semantics and the semantic web, so I figured he would be the right person to talk to. Soon after contacting him I understood he was only allowed to supervise me if my research contributed to what the Bio-Imaging Group was working on. This worried me at first, but after talking with Joris, I figured my project could actually be close enough to what I wanted to do, with the added advantages that:
- My research would actually contribute to something
- My domain would be comfortably restricted
The short explanation: Automatically analyzing and categorizing a large number of texts to be able to define their subjects. In my specific case the texts will be ‘free-form’, natural language descriptions of microscopic imagery, from the Cyttron database. This database contains a large number of images, accompanied by a small description (written by scientists) and a small list of tagwords. That is, if either of these fields are filled in at all. Because of the inconsistent style and method of writing these descriptions, an automated system to properly categorize the images would be very useful.
To perform this analysis, the idea is to use biological ontologies. Ontologies are basically large ‘dictionaries’ containing very specific (biological) terms with their definitions. The ontologies do not only contain their definitions, they also contain how these terms relate to each other. It basically provides me with a hierarchy of terms that says what is part of what, equal to what, etc.
Using these ontologies to analyze the texts allows not only to be able to define the subject of the text, but also to use the data in the ontology to be able to say more about the subject than what can be found in the text.
When I run into problems, I could at some point determine whether existing (biological) ontologies are either missing data, or whether there are more fundamental issues with the matching of the human-produced data with the ontologies.
How am I going to do this?
This part is very much subject to change, as I am taking my first steps in the entire semantic web/OWL/RDF-world, but also in the Python/NLTK-programming world. My current idea is:
- Python for text-processing
- RDFLib to read ontologies
- NLTK for the ‘language tasks’: stemming words, filtering for keywords, etc.
- Scanning the database for occurring ontology-terms (literal matches)
- Generating a list of keywords from both the free-form text and the ontology-term descriptions, to try to match those if no literal matches are found. I could try this using a bag-of-words-model, to remove all ‘common’ words, and keep the more specific/interesting ones. Another approach is to remove all stopwords from the texts and count the frequency of the remaining words.
- Possibly looking at keyphrase extraction instead of simple keywords [or maybe looking at word collocations/chunk extraction?].
- Apply fuzzy word matching to allow typo’s in the texts.
- Performing a statistical analysis on the likeliness of the subject. My thought is that ‘more specific’ (aka deeper nested) ontology terms should weigh heavier than more general terms. That I might potentially find clusters of terms (a number of terms that are more related to each other than other terms found) to further specify likeliness of subject matter. But I can imagine that when I actually get at this point, new ideas might emerge.
- The idea is to acquire some (humanly-checked) training data so I can optimize the system and see what approaches work best.
And what if it works?
Even though this sounds far away currently, I will have to take this scenario into account :p. My idea is to use the software I have written in other domains. Maybe even the domain I was thinking about earlier (using the web as a source for research, blogs, social media, news sites, wiki/dbpedia, etc.). I already came across the OpenCYC Ontology – “hundreds of thousands of terms, along with millions of assertions relating the terms to each other, forming an ontology whose domain is all of human consensus reality”. Which sounds pretty damn awesome.
Some quick ideas I had were using this ontology to create some sort of ‘semantic recommender system’ (on what domain? News sites? Blogs?), or find some other way to extract meaning from large corpora of texts. Anyway, those are ideas for the future, but I hope that I’ll be able to play around a bit with different applications by the time I’ve finished what I’m supposed to do :).