It’s a wrap!

Photo by Thijs Niks

This category can soon be archived ;)! Earlier this week I handed in my final paper, and yesterday was the day of my final presentation. It was a great day and I’m really excited about embarking on my next adventure. I will soon start as a PhD candidate at the University of Amsterdam, on a very exciting project in ‘Semantic Search in e-Discovery’ at the Information and Language Processing Systems group. Naturally, this blog will keep the world informed of my work and projects ;). Exciting times!

Paper

Download my paper: Automatic Annotation of Cyttron Entries using the NCIthesaurus [PDF – 328 KB] Download the supplementary data (graphs, tables and viz): Supplementary Data [PDF – 2.27 MB]

Demo

Similarity graph demo

Check out the D3.js-powered demo of a similarity graph (comparing expert & computer-generated annotations)

Continue reading “It’s a wrap!”

Measure and Visualize Semantic Similarity Between Subgraphs

As I blogged previously, I am working on measuring the performance of my keyword extraction algorithms. The confusion matrix approach I have implemented is quite ‘harsh’. It ignores any semantic information and simply treats the concepts as words, and counts hits and misses between two sets of concepts.

To benefit from the semantic  information described in the NCI Thesaurus, and thus produce more detailed results, I will measure the algorithm’s performance by measuring the semantic similarity between the lists of concepts. The two lists (expert data & algorithm) are treated as subgraphs within the main graph: the NCI Thesaurus. Their similarity is measured with a path-based semantic similarity metric, of which there are several. I have implemented Leacock & Chodorow’s measure, as in the literature I found it consistently outperforms similar path-based metrics in the Biomedical domain. Speaking of domain; this measure has originally been designed for WordNet (as many of the other metrics), but has also been used and validated in the Biomedical domain. Hooray for domain-independent, unsupervised and corpus-free approaches to similarity measurement ;-). Continue reading “Measure and Visualize Semantic Similarity Between Subgraphs”

Abstract

Below the first draft of the abstract of my paper. It doesn’t yet include the results/conclusion. Word count: 127

Semantic annotation uses human knowledge formalized in ontologies to enrich texts, by providing structured and machine-understandable information of its content. This paper proposes an approach for automatically annotating texts of the Cyttron Scientific Image Database, using the NCI Thesaurus ontology. Several frequency-based keyword extraction algorithms, aiming to extract core concepts and exclude less relevant concepts, were implemented and evaluated. Furthermore, text classification algorithms were applied to identify important concepts which do not occur in the text. The algorithms were evaluated by comparing them to annotations provided by experts. Semantic networks were generated from these annotations and an ontology-based similarity metric was used to cross-compare them. Finally the networks were visualized to provide further insights into the differences of the semantic structure generated by humans, and the algorithms.

Tags: Semantic annotation, ontology-based semantic similarity, semantic networks, keyword extraction, text classification, network visualization, text mining

Algorithm performance measurement: Confusion Matrix

As I am starting to gather testing data, I figured it’d be a good time to determine how to measure the performance of the different results of my 24 text representation algorithms. What I want to measure per algorithm: how many keywords are predicted the same as the experts, and how many aren’t. After some research and valuable advice, I came across confusion matrices, which seemed appropriate. For each algorithm I measure the amount of:

Predicted
Negative Positive
Actual Negative A B
Positive
C
D

A. True Negatives (excluded by algorithm & excluded by experts)
B. False Positives (included by algorithm & excluded by experts)
C. False Negatives (excluded by algorithm & included by experts)
D. True Positives (included by algorithm & included by experts)

I found this page from the University of Regina explaining confusion matrices, and decided to implement it. My implementation in pseudocode:

For each text:

  fill expertList with expertResults #list of URIs
  fill algoList with algorithmResults #list of URIs

  algoPOS = algoList[text] #URIs included by algorithm
  algoNEG = [item for item in NCIthesaurus if item not in algoPOS] #URIs excluded by algorithm (ontology-algoPOS)
  expertPOS = expertList[text] #all URIs included by experts
  expertNEG = [item for item in NCIthesaurus if item not in expertPOS] #all URIs excluded by experts (ontology-expertPOS)

  A += len(set(algoNEG).intersection(expertNEG)) #True Negatives (number of URIs that overlap in algoNEG and expertNEG)
  B += len(set(algoPOS).intersection(expertNEG)) #False Positives (number of URIs that overlap of algoPOS and expertNEG)
  C += len(set(algoNEG).intersection(expertPOS)) #False Negatives (number of URIs that overlap in algoNEG and expertPOS)
  D += len(set(algoPOS).intersection(expertPOS)) #True Positives (number of URIs that overlap in algoPOS and expertPOS)

  matrix.append([[A,B],[C,D]]) #Put numbers in a matrix

With this information I can calculate a set of standard terms:
Accuracy (AC), Recall or True Positive Rate (TP), False Positive Rate (FP), True Negative Rate (TN), False Negative Rate (FN), Precision (P).

  AC = ((A+D) / (A+B+C+D))
  TP = ((D) / (C+D))
  FP = ((B) / (A+B))
  TN = ((A) / (A+B))
  FN = ((C) / (C+D))
  P = ((D) / (B+D))

The only thing with the rates are that proportions are heavily skewed because of the size of the ontology (90.000 URIs), which means the negative cases will always be much more frequent than the positives. This means that accuracy is always around 99.9%, and so is the TN. I still have to figure out exactly what information I want to use and how (visualize?).

Force-Directed Graphs: Playing around with D3.js

Update: Newer example of Force-Directed d3.js Graph here: Measure and Visualize Semantic Similarity Between Subgraphs

I recently replaced python-graph in my code with NetworkX, a slightly more sophisticated graph library for Python. Besides some more advanced algorithms for graph analysis (comparison, unison etc.) which can prove useful when analyzing data (comparing human data with mine, for example), I can also easily export my graphs to all kinds of formats. For example, to JSON. As I was getting a bit tired of GraphViz’ stubborn methods, and it’s far from dynamic approach, I decided to start playing around with the excellent Data Driven Documents JavaScript library, better known as D3.js, the successor to Protovis. Actually I had planned this quite a while ago, simply because I was impressed with the Force-directed Graph example on their website. I figured for coolness sake, I should implement them, instead of using the crummy GraphViz graphs.

So after a night and day of tinkering with the D3 code (starting from the Graph example included in the release, modifying stuff as I went) I came to this:

Click to play!

The red nodes are the concepts taken from the texts (either literal: filled red circles, or resulting from text classification: red donuts). The orange nodes are LCS-nodes (Lowest Common Subsumers), aka ‘parent’ nodes, and all the grey ones are simply in-between nodes (either for shortest paths between nodes, or parent nodes).

I added the labels, and also implemented zoom and panning functionality (mousewheel to zoom, click and drag to pan), included some metadata (hover with mouse over nodes to see their URI, over edges to see the relation). I am really impressed with the flexibility of D3, it’s amazing that I can now load up any random graph produced from my script, and  instantly see results.

The bigger plan is to make a fully interactive Graph, by starting with the ‘semantic similarity’ graph (where only the red nodes are displayed), and where clicking on edges expands the graph, by showing the relationship between two connected nodes. Semantic expansion at the click of a mouse ;)!

In other news

I’ve got a date for my graduation! If everything goes right, March 23rd is the day I’ll present my finished project. I’ll let you know once it’s final.

Not dead (yet)

While I haven’t been as active and hard working on my graduation project as I would have liked to be, I am not dead (nor the project). Earlier this week I presented my project to the Bio-imaging group of Leiden University, which helped me a lot. I was able to present my project pretty much as-is, since I’m mostly done with the technical parts. I received valuable feedback and got good insights into what I should explain more thoroughly in the presentation. Continue reading “Not dead (yet)”

Pathfinder finds paths!

First results are in! My computer spent about 20 hours to retrieve and store neighboring concepts of over 10.000 concepts which my Breadth-First-Search algorithm passed through to find the shortest paths between 6 nodes. But here is the result, first the ‘before’ graph, which I showed earlier: all retrieved concepts with their parent relations. Below that is the new graph, which relates all concepts by finding their shortest paths (so far only the orange concepts – from the Gene Ontology).

Before

After

So?

What were two separate clusters in the first to the left is now one big fat cluster… Which is cool!

Less cool is the time it took… But oh well, looks like I’m going to have to prepare some examples as proof of concepts. Nowhere near realistic realtime performance so far… (however I got a big speed increase by moving my Sesame triple store from my ancient EeePC900 to my desktop computer… Goodbye supercomputer). The good news is that all neighboring nodes I processed so far are cached in a local SQLite database, so those 20 hours were not a waste! (considering my total ontology database consists out of over 800.000 concepts, and 10.000 concepts took 20 hours, is something I choose not to take into consideration however :p).

It is important to note that the meaning or interpretation of the resulting graph (and particularly the relations between concepts) is not the primary concern here: the paths (their lengths, the directions of the edges and the node’s ‘depths’) will be primarily used for the ontology-based semantic similarity measure I wrote about in this post.

Ontology-based Text Visualization: Towards a Visual Language

The first steps towards creating graph-based text visualizations is thinking about how I would like to visually represent the structure of the information I extract from texts. I have included a preliminary example of the output at the bottom of this page. I use three different methods of extracting concepts from a piece of text:

  1. Counting literal occurrences of concepts (from ontologies)
  2. Finding related concepts by textual comparison of the text to the concepts’ descriptions
  3. Finding related concepts by exploring the ontological structure (aka relating concepts within one ontology by finding paths and parents, and possibly relevant neighbours)

The primary distinction I want to make is that of relevance (aka ‘likeliness of topic’). In the case of 1 this would be the frequency of the word (more occurrences = more relevant concept), in the case of 2 this is the calculated similarity of the source text to the concepts’ descriptions (which is a number between 0 and 1). In the case of 3, this would be by ‘connections’: the more concepts the concepts I find by exploring an ontologies’ structure would link, the more relevant this found concept would be. I want to model this distinction by node’s size: the more relevant a concept is, the bigger I want to draw the concept in the graph.

The second distinction is that of literal to non-literal (1 being literal, 2 and 3 being non-literal). I want to model this distinction by style: literal concepts will be drawn as a filled circle, non-literal concepts as outlined circles.

The third distinction is that of the concepts’ source: from which ontology does a concept originate? This distinction will be modeled by color: each ontology (of the six I use) will have its distinct color. Explored concepts (from step 3) such as parents and shared parents will be colored in distinct colours as well, since they are connected in the graph to the coloured nodes, their source will implicitly be clear.

Since Gephi doesn’t fully support Graphviz’ DOT-language, and the graph library I use in Python conveniently parses graphs in DOT, I use Graphviz to directly render the results.

An issue I came across with the scaling (to represent relevance), is that I’m working with multiple measures: frequency of literal words (1), percentage of text-similarity (2), and degree-count. In an effort to roughly equalize the scaling factor, I decided to use static counters. Each node gets an initial size of 0.5 (0.4 for shared parents, and 0.3 for parents). For each occurrence of a literal word, I add 0.05. For the text-similarity, I add the percentage of 0.5 (26% similarity = 0.5 + (0.5*0.26)). For the degree, I add 0.1 for each in-link the node receives. This is an initial attempt at unifying the results. Anyway, these are just settings I’m playing around with.

Examples


These are very rough examples, because:
  • I use the literal representation of the input text (as I have not yet determined the most suitable keyword extraction method)
  • I haven’t determined a proper ‘cut-off’ for the text similarity measure; currently it includes every concept it finds with similarity ≥ 25%.
  • It doesn’t yet fully incorporate step 3 (it includes parents, but not yet paths between nodes)
  • It doesn’t scale nodes according to in-links
  • There is no filtering applied yet (removing obsolete classes, for example).

Add some #dataviz in the mix

All this time I have been working on my ontology-powered topic identification system: combining semantic web technologies with natural language processing and text-mining techniques to extract (a) topic(s) from a text. But I never really decided what my program should output: a topic? A list of potential topics with their ‘likeliness’? That would mean the relationships between topics my SPARQL-powered ‘pathfinder’ finds would disappear in an algorithm that used those paths to calculate semantic similarity, which would be a pitty. This weekend it finally all came together:

Visualize!

I already played around with drawing graphs with Gephi, as a method to check the results in a way other than processing huge lists in my python interpreter. But now I realize it could be the perfect ‘final product’. What I want to create is a ‘semantically-augmented tag-cloud-graph‘. As opposed to a standard tagcloud, my augmented tag graph will be a visualization of the text, composed of both interlinked concepts and solitary concepts, concepts that can be found literally in the text and concepts that aren’t mentioned anywhere in the text. The tag graph will benefit from the linked data nature of ontologies to:

1. Show relationships between concepts
2. Show extra concepts which do not occur literally in the text: nodes that occur in a path between two nodes, and maybe ‘superClass’ and ‘subClass’ nodes.

In this way, it will be similar to a tag cloud as it conveys the content of a text, but augmented as it could convey the meaning of a text and relationships between tags. It will also be similar to the DBPedia RelFinder, but augmented as it will show links but also contain a hierarchy of more to less important concepts. I hope that in the end it could be a viable alternative of graphic text-representation.

Visualize what?

As I see it, my semantic graph-cloud should communicate at least these three things:
1. The most likely topic of the text
2. The concepts which occur literally in the text versus concepts that do not occur in the text
3. Clusters of similar concepts

My first idea is to model these three properties by size, alpha-channel and colour respectively, the bigger node, the more likely the topic, transparant nodes are the ones that do not occur in the text, and coloured nodes to group semantically similar nodes. But that’s just my initial plan, I might have to give it some more thoughts and experiment with it.

Next, I should think about a method to ‘measure’ whether my semantic tagcloud conveys more information, or has ajy added value. In the end, I am totally unsure whether the resulting tag graphs will make sense as it depends on multiple factors such as the efficiency of my keyword extraction, the efficiency of the vector-space based string comparison, the quality of the ontologies, etc.

The good news is, most of the technical work is in a close-to-finished state. I have various methods of extracting keywords from large texts (which I will be validating soon), I have a functional method to find ‘new’ concepts by comparing a text to all the descriptions of ontology concepts, I have a functional shortest-path finder to explore how two concepts are related (and produce a graph of it). It’s just a matter of putting it all together and selecting a suitable tool to draw the graphs. I don’t think I’ll use Gephi, as I want to fully integrate the graph drawing in my script, so who knows, maybe it’s back to NetworkX, igraph, or maybe Protovis?

In the mean time I’m looking at validating the data I generated with the various keyword extraction algorithms. By using human experts and cross-validating with existing keywords from certain entries, I’ll be able to evaluate which method of keyword extraction is the most efficient for my purposes. Once that’s done, and  I’ve picked my graph-drawing method, I can start showing some preliminary results!

Computing string similarity with TF-IDF and Python

“The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.”[wikipedia]

It is also the weight I use to measure similarity between texts, for these two tasks of my thesis project (click for pic!):

Step 3: measure the similarity of a cyttron-db entry to a concept-description from an ontology. This will allow me to find concepts in the text  that do not appear literally.
Step 5: to be able to relate concepts which come from different ontologies. By measuring how similar the text surrounding a concept found in the text is compared to another found concept.

As mentioned before, I am using the excellent Gensim “vector space modelling for humans” package, which takes all the complicated mathematics off my hands (like the scary and intimidating formula up top!). Perfect for me, as I’m not mathematician, nor a computational linguist, nor a statistician, but I AM a human, who wants to work with a solid and proven method of  similarity measures and feature extraction for texts. Since I am what I am, I won’t attempt to explain any of the inner workings of Bag-of-word models, vector space, and TF-IDF measures, sorry, there are much better places for that. I’ll simply show how I made Gensim work for me (assuming it does).

The first step is to create a training corpus. The training corpus defines the features of the text – the words that will be considered ‘important’ when looking at a text. The training corpus needs to be from the same domain as the target application: in my case the biomedical domain.

At first I was looking at extracting a bunch of relevant Wikipedia articles (all articles from Wikipedia’s Biology category) to use as a training corpus. But then I came across something better: the Open Access BioMed Central full-text corpus. The corpus consists of over 100.000 articles, weighing in at 8GB of XML-documents.

I wrote a simple script using lxml2 to parse the individual files: extracting all plaintext from the article body, cleaning them and storing them in a new text-file (1 article per line) for later processing. The cleaning process consists out of 3 steps: tokenizing articles (aka breaking an article up in words), filtering for common stopwords, and finally stemming the remaining words. I chose to include stemming, in order to unify such words as ‘hippocampal’ and ‘hippocampus’ (stemming returns the ‘root’ of a word). As I stem both the training corpus and the strings that need to be compared, it is not a disaster if words get stemmed incorrectly: in the end I don’t need to make sense out of the stemmed words, I only need them for counting. The plaintext file my script created is 650MB (vs 8,8GB for the uncompressed XML-files)!

The cleaning of the article is pretty straightforward, using pre-cooked NLTK modules: the WordPunct tokenizer, set of English stopwords and NLTK’s implementation of the Porter stemmer. For the quality of the similarity measurement it is important to follow the exact same cleaning procedure with the strings I want to compare – I use the same function for both the corpus-preparation as that of the comparison strings:

def cleanDoc(doc):
    stopset = set(stopwords.words('english'))
    stemmer = nltk.PorterStemmer()
    tokens = WordPunctTokenizer().tokenize(doc)
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
    final = [stemmer.stem(word) for word in clean]
    return final

Creating a training corpus in Gensim

Gensim‘s documentation is very extensive, and I can recommend going through the tutorials if you want to get an idea of the possibilities. But I couldn’t find much documentation on how to do simple string-to-string comparisons, so I wrote down what I did (and errrm yes, it’s pretty much exactly the same as string-to-index querying you can find in the Gensim tutorials :p):

1. Create a ‘dictionary’ of the training corpus’ raw text:

The dictionary contains words:frequency mappings and will be used to convert texts to vector space at a later stage:

>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
>>> print dictionary
Dictionary(1049403 unique tokens)

2. Convert the training corpus to vector space:

class MyCorpus(object):
    def __iter__(self):
        for line in open('corpus.txt'):
            yield dictionary.doc2bow(line.lower().split())
>>> corpus = MyCorpus()
>>> corpora.MmCorpus.serialize('corpus.mm', corpus) # Save corpus to disk
>>> corpus = corpora.MmCorpus('corpus.mm') # Load corpus
>>> print corpus
MmCorpus(99432 documents, 1049403 features, 39172124 non-zero entries)

3. Initialize the TF-IDF model:

>>> tfidf = models.TfidfModel(corpus)
>>> print tfidf
TfidfModel(num_docs=99432, num_nnz=39172124)

Thankfully it’s possible to store the generated corpus, dictionary and tfidf to disk: parsing all these documents takes quite a while on my computer. That’s it for the preparation of the training corpus!

Comparing two strings

Now whenever I want to compare two strings, using features gathered from the training corpus, I need to:

  1. Clean both strings in the same way I cleaned the articles in the corpus (NLTK stopword-filter + tokenization) » cleanDoc(string)
  2. Convert both strings to vector-space using the dictionary generated from the training corpus » dictionary.doc2bow(string)
  3. Convert both vector-space representations of the strings to TF-IDF space, using the TF-IDF model initialized earlier » tfidf[string]

When both strings are prepared, all is left to compare them, by creating an ‘index’ (the reference string) and a ‘query’ (the other string). Order doesn’t matter.

index = similarities.MatrixSimilarity([tfidf1],num_features=len(dictionary))
sim = index[tfidf2]
print str(round(sim*100,2))+'% similar'

Resulting in, for example, the comparison of the description of “Alzheimer’s disease” and “Cognitive disease” in the Human Disease (DOID) ontology:

>>> compareDoc("""A dementia that results in progressive memory loss, impaired thinking,
disorientation, and changes in personality and mood starting in late middle age and leads
in advanced cases to a profound decline in cognitive and physical functioning and is marked
histologically by the degeneration of brain neurons especially in the cerebral cortex and
by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is
characterized by memory lapses, confusion, emotional instability and progressive loss of
mental ability.""","""A disease of mental health that affects cognitive functions including
memory processing, perception and problem solving.""")

23.29% similar

Or another example: the Wikipedia article of “Alzheimer’s disease” compared to the ontology description of “Alzheimer’s disease”:

>>> wikiGet('alzheimer')
alzheimer in wikiTxt
>>> compareDoc(wikiTxt,"""A dementia that results in progressive memory loss, impaired thinking,
disorientation, and changes in personality and mood starting in late middle age and leads in
advanced cases to a profound decline in cognitive and physical functioning and is marked
histologically by the degeneration of brain neurons especially in the cerebral cortex and by
the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized
by memory lapses, confusion, emotional instability and progressive loss of mental ability.""")

31.95% similar

Final example: the top 5 most similar ontology concepts to the Wikipedia page of “Alzheimer’s disease”:

 >>> descMatch(wikiAlz)
Label: Alzheimer's disease
Similarity: 31.9990843534

Label: vascular dementia
Similarity: 28.0893445015

Label: amyloid deposition
Similarity: 25.6860613823

Label: cognitive disease
Similarity: 18.7662974

Label: dementia
Similarity: 18.0801317096

Now the second task (of matching a string to all the descriptions from my ontologies is much the same process, with the only difference that I need to use the similarities.Similarity object when creating the index (of the descriptions): the MatrixSimilarity object resides fully in RAM, the Similarity object on disk.

I am pretty confident about these preliminary results. It all seems to work as it should, and should be much more robust than my earlier attempts at similarity measurement using difflib and some crummy homegrown keyword-extraction and comparison (which I will still use for generating synonyms, crumminess works for that).

Python graphs and visualizations

To my right is a visualization of the output of my SPARQL-powered shortest path algorithm, finding a link between ‘intracellular and extracellular accumulation’ & ‘developmental and adult structural defect’, 2 concepts in the Mouse Pathology ontology. Click it! It shows the two ‘source’ concepts in white, and the shortest path (of 3 nodes: 4 hops) in grey. It looks like this in python:

[[u’http://purl.obolibrary.org/obo/MPATH_56′, u’http://www.w3.org/2000/01/rdf-schema#subClassOf’, u’http://purl.obolibrary.org/obo/MPATH_55′], [u’http://purl.obolibrary.org/obo/MPATH_55′, u’http://www.w3.org/2000/01/rdf-schema#subClassOf’, u’http://purl.obolibrary.org/obo/MPATH_603′], [u’http://purl.obolibrary.org/obo/MPATH_1′, u’http://www.w3.org/2000/01/rdf-schema#subClassOf’, u’http://purl.obolibrary.org/obo/MPATH_603′], [‘http://purl.obolibrary.org/obo/MPATH_33′, u’http://www.w3.org/2000/01/rdf-schema#subClassOf’, u’http://purl.obolibrary.org/obo/MPATH_1′]]

Which is clearly less awesome.

My algorithm generates a directed subgraph of all the nodes and edges it encounters during its search for a path between two ontology concepts. I figured generating this subgraph would make it easier to get some of the variables I need for the semantic similarity measurement calculation (such as amount changes in directions in the path, node’s depth from the root, etc.). Furthermore, I can use the subgraph to more easily assign weights to the textual data surrounding the nodes, when assembling my bag-of-words model of a node’s direct context, as I’ve explained in my previous post.

What to use

There are heaps of libraries for managing graphs in Python, and loads of programs to visualize and manipulate graphs. Here’s my stuff of choice.

After looking at several python-graph libraries ( NetworkX, igraph, graph-tool, etc.) I choose to use python-graph, which in my opinion is the most straightforward, compact, simple yet powerful graph-lib for python. I use python-graph to generate a file describing the subgraph in DOT language (“plain text graph description language”). This file can then be imported in a wide array of programs to visualize and manipulate the graph.

Visualizing the subgraph containing the shortest path between two nodes would allow me to get a better picture of what my algorithm fetches, and also to double-check the results (seeing is believing ;)). To visualize graphs, there are plenty of options again. After sifting through Tulip, Graphviz and some other obscure programs I stumbled upon Gephi, a very complete, pretty and simple open-source graph visualization & manipulation program. It has extensive documentation, and several advanced features to manipulate the graph and fetch some values. Ideally though, I will manage all those ‘advanced value-fetching tasks’ in my python script. Gephi still provides a nice and quick way to double-check some of the output and get a more concrete idea about what’s happening, as things can get pretty complex, pretty fast:

1136 nodes of the DOID-ontology, subgraph produced when finding a link for DOID_4 (disease) and DOID_10652 (Alzheimer's Disease) - red nodes in the graph. Linked by the three yellow nodes.

Simple keyword extraction in Python: choices, choices.

As explained in an earlier post, I am working on a simple method of extracting ‘important words’ from a text-entry. The methods I am using at the moment are frequency distributions and word collocations. I’ve bumped into some issues regarding finetuning my methods. Read on for a short explanation of my approaches, and some issues regarding them.

Frequency Distribution: POS-tagging y/n?

Extracting keywords by frequency distribution is nothing more than counting words and sorting the list of words by occurrence. Before doing this, I filter stopwords from the text entry. The short explanation on how I’m doing this (sourcecode available at github):

» Tokenize the text (using NLTK’s WordPunctTokenizer)
» Lowercase all the words
» ‘Clean’ the list by removing common stopwords from the list (using NLTK’s English stopwords-list)

This is straightforward enough, an example of the results (from the WikiPedia page of ‘Apoptosis‘):

>>> cyttron.wikiGet('Apoptosis')
Apoptosis in wikiTxt
>>> freqWords(cyttron.wikiTxt,15)
['apoptosis', 'cell', '160', 'apoptotic', 'cells', 'caspase', 'death',
'.&#', 'proteins', 'tnf', 'bcl', 'protein', 'also', 'caspases']

Earlier I was thinking about using POS-tagging (Part-Of-Speech tagging to identify word-types) in order to only  extract frequently occurring nouns. I figured losing relevant adjectives (such as ‘red’ in red blood cell) could be compensated by the word collocations extraction. POS-tagging the tokenized text, and retrieving only the most frequent nouns results in:

>>> freqNouns(cyttron.wikiTxt,15)
['apoptosis', 'cell', 'caspase', 'death', 'protein', 'tnf', 'pathway',
'activation', 'membrane', 'p53', 'response', 'family', 'gene', 'greek']

My problem here is I’m not sure which is ‘better’ (if any of those two), or if I should maybe use a combination of both. Also, I haven’t decided yet how to handle non-alphabetic words. Initially I planned on using regular expressions to filter non-alphabetic strings, but I figured later that it wouldn’t make sense in my case. In the above example, this would omit retrieving ‘p53’: a tumor suppressor protein (P53), which is very relevant.

With earlier playing around with POS-tagging I noticed the precision was not quite high enough to perform chunk extractions (by looking for specific phrases / grammatical constructions). Extracting only nouns does seem to do quite the job, even if I still miss some and get some false positives.

Word Collocations: Stopword filtering y/n?

Collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. I generate bi- and trigram word collocations, which mean ‘2-word strings’ and ‘3-word strings’. My issue here is whether or not to use stopword filtering. Here are the results of the word collocation function on the same WikiPedia page, the 1st list being the bigram collocations, the 2nd being the trigrams. Example without stopword filtering:

>>> wordCollo(cyttron.wikiTxt,10,clean=False)
['such as', 'cell death', 'of the', 'due to', 'leads to', 'programmed cell',
'has been', 'bone marrow', 'have been', 'an increase']
['adp ribose polymerase', 'amino acid composition', 'anatomist walther flemming',
'boston biologist robert', 'break itself down', 'combining forms preceded',
'count falls below', 'german scientist carl', 'homologous antagonist killer',
'mdm2 complexes displaces']

Example with stopword filtering:

>>> wordCollo(cyttron.wikiTxt,10,clean=True)
['cell death', 'programmed cell', 'bone marrow', 'university aberdeen',
'calcium concentration', 'adenovirus e1b', 'british journal', 'citation needed',
'highly conserved', 'nitric oxide']
['adp ribose polymerase', 'agar gel electrophoresis', 'amino acid composition',
'anatomist walther flemming', 'appearance agar gel', 'awarded sydney brenner',
'boston biologist robert', 'carl vogt first', 'ceases respire aerobically',
'closely enough warrant']

As you can see, lots of garbage in the first example, but still some collocations that do not appear in the cleaned version. Similar to the noun-extraction issue with the previous approach, I wonder if I should choose for one of the two, or combine them.

In other news

Finding Gensim has been a life-saver! Instead of using Difflib to compare two strings, I now use a proper text-similarity metric, namely cosine similarity measurement. I do so by creating a TF-IDF weighted corpus out of the (stopwords-cleaned) descriptions of ontology-terms I use, and calculating the cosine similarity between an input string and each entry in the corpus. Gensim makes this all a breeze to do. An example of the ouput:

>>> wikiGet('alzheimer')
alzheimer in wikiTxt
>>> descMatch(wikiTxt,5)
Label: Alzheimer's disease
Similarity: 0.236387
Description: A dementia that results in progressive memory loss, impaired thinking, disorientation, and changes in personality and mood starting in late middle age and leads in advanced cases to a profound decline in cognitive and physical functioning and is marked histologically by the degeneration of brain neurons especially in the cerebral cortex and by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized by memory lapses, confusion, emotional instability and progressive loss of mental ability.

Label: vascular dementia
Similarity: 0.192565
Description: A dementia that involves impairments in cognitive function caused by problems in blood vessels that feed the brain.

Label: dementia
Similarity: 0.157553
Description: A cognitive disease resulting from a loss of brain function affecting memory, thinking, language, judgement and behavior.

Label: cognitive disease
Similarity: 0.13909
Description: A disease of mental health that affects cognitive functions including memory processing, perception and problem solving.

Label: encephalitis
Similarity: 0.138719
Description: Encephalitis is a nervous system infectious disease characterized as an acute inflammation of the brain. The usual cause is a viral infection, but bacteria can also cause it. Cases can range from mild to severe. For mild cases, you could have flu-like symptoms. Serious cases can cause severe headache, sudden fever, drowsiness, vomiting, confusion and seizures.

I’m not sure if the similarity numbers it produces indicate I’m doing something wrong (there’s no high similarity), but intuitively I would say the results do make sense.