As I blogged previously, I am working on measuring the performance of my keyword extraction algorithms. The confusion matrix approach I have implemented is quite ‘harsh’. It ignores any semantic information and simply treats the concepts as words, and counts hits and misses between two sets of concepts.
To benefit from the semantic information described in the NCI Thesaurus, and thus produce more detailed results, I will measure the algorithm’s performance by measuring the semantic similarity between the lists of concepts. The two lists (expert data & algorithm) are treated as subgraphs within the main graph: the NCI Thesaurus. Their similarity is measured with a path-based semantic similarity metric, of which there are several. I have implemented Leacock & Chodorow’s measure, as in the literature I found it consistently outperforms similar path-based metrics in the Biomedical domain. Speaking of domain; this measure has originally been designed for WordNet (as many of the other metrics), but has also been used and validated in the Biomedical domain. Hooray for domain-independent, unsupervised and corpus-free approaches to similarity measurement ;-). (more…)
First results are in! My computer spent about 20 hours to retrieve and store neighboring concepts of over 10.000 concepts which my Breadth-First-Search algorithm passed through to find the shortest paths between 6 nodes. But here is the result, first the ‘before’ graph, which I showed earlier: all retrieved concepts with their parent relations. Below that is the new graph, which relates all concepts by finding their shortest paths (so far only the orange concepts – from the Gene Ontology).
What were two separate clusters in the first to the left is now one big fat cluster… Which is cool!
Less cool is the time it took… But oh well, looks like I’m going to have to prepare some examples as proof of concepts. Nowhere near realistic realtime performance so far… (however I got a big speed increase by moving my Sesame triple store from my ancient EeePC900 to my desktop computer… Goodbye supercomputer). The good news is that all neighboring nodes I processed so far are cached in a local SQLite database, so those 20 hours were not a waste! (considering my total ontology database consists out of over 800.000 concepts, and 10.000 concepts took 20 hours, is something I choose not to take into consideration however :p).
It is important to note that the meaning or interpretation of the resulting graph (and particularly the relations between concepts) is not the primary concern here: the paths (their lengths, the directions of the edges and the node’s ‘depths’) will be primarily used for the ontology-based semantic similarity measure I wrote about in this post.
All this time I have been working on my ontology-powered topic identification system: combining semantic web technologies with natural language processing and text-mining techniques to extract (a) topic(s) from a text. But I never really decided what my program should output: a topic? A list of potential topics with their ‘likeliness’? That would mean the relationships between topics my SPARQL-powered ‘pathfinder’ finds would disappear in an algorithm that used those paths to calculate semantic similarity, which would be a pitty. This weekend it finally all came together:
I already played around with drawing graphs with Gephi, as a method to check the results in a way other than processing huge lists in my python interpreter. But now I realize it could be the perfect ‘final product’. What I want to create is a ‘semantically-augmented tag-cloud-graph‘. As opposed to a standard tagcloud, my augmented tag graph will be a visualization of the text, composed of both interlinked concepts and solitary concepts, concepts that can be found literally in the text and concepts that aren’t mentioned anywhere in the text. The tag graph will benefit from the linked data nature of ontologies to:
1. Show relationships between concepts
2. Show extra concepts which do not occur literally in the text: nodes that occur in a path between two nodes, and maybe ‘superClass’ and ‘subClass’ nodes.
In this way, it will be similar to a tag cloud as it conveys the content of a text, but augmented as it could convey the meaning of a text and relationships between tags. It will also be similar to the DBPedia RelFinder, but augmented as it will show links but also contain a hierarchy of more to less important concepts. I hope that in the end it could be a viable alternative of graphic text-representation.
As I see it, my semantic graph-cloud should communicate at least these three things:
1. The most likely topic of the text
2. The concepts which occur literally in the text versus concepts that do not occur in the text
3. Clusters of similar concepts
My first idea is to model these three properties by size, alpha-channel and colour respectively, the bigger node, the more likely the topic, transparant nodes are the ones that do not occur in the text, and coloured nodes to group semantically similar nodes. But that’s just my initial plan, I might have to give it some more thoughts and experiment with it.
Next, I should think about a method to ‘measure’ whether my semantic tagcloud conveys more information, or has ajy added value. In the end, I am totally unsure whether the resulting tag graphs will make sense as it depends on multiple factors such as the efficiency of my keyword extraction, the efficiency of the vector-space based string comparison, the quality of the ontologies, etc.
The good news is, most of the technical work is in a close-to-finished state. I have various methods of extracting keywords from large texts (which I will be validating soon), I have a functional method to find ‘new’ concepts by comparing a text to all the descriptions of ontology concepts, I have a functional shortest-path finder to explore how two concepts are related (and produce a graph of it). It’s just a matter of putting it all together and selecting a suitable tool to draw the graphs. I don’t think I’ll use Gephi, as I want to fully integrate the graph drawing in my script, so who knows, maybe it’s back to NetworkX, igraph, or maybe Protovis?
In the mean time I’m looking at validating the data I generated with the various keyword extraction algorithms. By using human experts and cross-validating with existing keywords from certain entries, I’ll be able to evaluate which method of keyword extraction is the most efficient for my purposes. Once that’s done, and I’ve picked my graph-drawing method, I can start showing some preliminary results!
“The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.”[wikipedia]
It is also the weight I use to measure similarity between texts, for these two tasks of my thesis project (click for pic!):
– Step 3: measure the similarity of a cyttron-db entry to a concept-description from an ontology. This will allow me to find concepts in the text that do not appear literally.
– Step 5: to be able to relate concepts which come from different ontologies. By measuring how similar the text surrounding a concept found in the text is compared to another found concept.
As mentioned before, I am using the excellent Gensim “vector space modelling for humans” package, which takes all the complicated mathematics off my hands (like the scary and intimidating formula up top!). Perfect for me, as I’m not mathematician, nor a computational linguist, nor a statistician, but I AM a human, who wants to work with a solid and proven method of similarity measures and feature extraction for texts. Since I am what I am, I won’t attempt to explain any of the inner workings of Bag-of-word models, vector space, and TF-IDF measures, sorry, there are much better places for that. I’ll simply show how I made Gensim work for me (assuming it does).
The first step is to create a training corpus. The training corpus defines the features of the text – the words that will be considered ‘important’ when looking at a text. The training corpus needs to be from the same domain as the target application: in my case the biomedical domain.
I wrote a simple script using lxml2 to parse the individual files: extracting all plaintext from the article body, cleaning them and storing them in a new text-file (1 article per line) for later processing. The cleaning process consists out of 3 steps: tokenizing articles (aka breaking an article up in words), filtering for common stopwords, and finally stemming the remaining words. I chose to include stemming, in order to unify such words as ‘hippocampal’ and ‘hippocampus’ (stemming returns the ‘root’ of a word). As I stem both the training corpus and the strings that need to be compared, it is not a disaster if words get stemmed incorrectly: in the end I don’t need to make sense out of the stemmed words, I only need them for counting. The plaintext file my script created is 650MB (vs 8,8GB for the uncompressed XML-files)!
The cleaning of the article is pretty straightforward, using pre-cooked NLTK modules: the WordPunct tokenizer, set of English stopwords and NLTK’s implementation of the Porter stemmer. For the quality of the similarity measurement it is important to follow the exact same cleaning procedure with the strings I want to compare – I use the same function for both the corpus-preparation as that of the comparison strings:
stopset = set(stopwords.words('english'))
stemmer = nltk.PorterStemmer()
tokens = WordPunctTokenizer().tokenize(doc)
clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
Creating a training corpus in Gensim
Gensim‘s documentation is very extensive, and I can recommend going through the tutorials if you want to get an idea of the possibilities. But I couldn’t find much documentation on how to do simple string-to-string comparisons, so I wrote down what I did (and errrm yes, it’s pretty much exactly the same as string-to-index querying you can find in the Gensim tutorials :p):
1. Create a ‘dictionary’ of the training corpus’ raw text:
The dictionary contains words:frequency mappings and will be used to convert texts to vector space at a later stage:
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
>>> print dictionary
Dictionary(1049403 unique tokens)
2. Convert the training corpus to vector space:
for line in open('corpus.txt'):
>>> corpus = MyCorpus()
>>> corpora.MmCorpus.serialize('corpus.mm', corpus) # Save corpus to disk
>>> corpus = corpora.MmCorpus('corpus.mm') # Load corpus
>>> print corpus
MmCorpus(99432 documents, 1049403 features, 39172124 non-zero entries)
Thankfully it’s possible to store the generated corpus, dictionary and tfidf to disk: parsing all these documents takes quite a while on my computer. That’s it for the preparation of the training corpus!
Comparing two strings
Now whenever I want to compare two strings, using features gathered from the training corpus, I need to:
Clean both strings in the same way I cleaned the articles in the corpus (NLTK stopword-filter + tokenization) » cleanDoc(string)
Convert both strings to vector-space using the dictionary generated from the training corpus » dictionary.doc2bow(string)
Convert both vector-space representations of the strings to TF-IDF space, using the TF-IDF model initialized earlier » tfidf[string]
When both strings are prepared, all is left to compare them, by creating an ‘index’ (the reference string) and a ‘query’ (the other string). Order doesn’t matter.
index = similarities.MatrixSimilarity([tfidf1],num_features=len(dictionary))
sim = index[tfidf2]
print str(round(sim*100,2))+'% similar'
Resulting in, for example, the comparison of the description of “Alzheimer’s disease” and “Cognitive disease” in the Human Disease (DOID) ontology:
>>> compareDoc("""A dementia that results in progressive memory loss, impaired thinking,
disorientation, and changes in personality and mood starting in late middle age and leads
in advanced cases to a profound decline in cognitive and physical functioning and is marked
histologically by the degeneration of brain neurons especially in the cerebral cortex and
by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is
characterized by memory lapses, confusion, emotional instability and progressive loss of
mental ability.""","""A disease of mental health that affects cognitive functions including
memory processing, perception and problem solving.""")
Or another example: the Wikipedia article of “Alzheimer’s disease” compared to the ontology description of “Alzheimer’s disease”:
alzheimer in wikiTxt
>>> compareDoc(wikiTxt,"""A dementia that results in progressive memory loss, impaired thinking,
disorientation, and changes in personality and mood starting in late middle age and leads in
advanced cases to a profound decline in cognitive and physical functioning and is marked
histologically by the degeneration of brain neurons especially in the cerebral cortex and by
the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized
by memory lapses, confusion, emotional instability and progressive loss of mental ability.""")
Final example: the top 5 most similar ontology concepts to the Wikipedia page of “Alzheimer’s disease”:
Now the second task (of matching a string to all the descriptions from my ontologies is much the same process, with the only difference that I need to use the similarities.Similarity object when creating the index (of the descriptions): the MatrixSimilarity object resides fully in RAM, the Similarity object on disk.
I am pretty confident about these preliminary results. It all seems to work as it should, and should be much more robust than my earlier attempts at similarity measurement using difflib and some crummy homegrown keyword-extraction and comparison (which I will still use for generating synonyms, crumminess works for that).
My thesis is about keyword extraction of biological notes, using semantic ‘dictionaries’ called ontologies. These ontologies are large networks, where each node stands for a concept, and each connection between nodes for relations. See the picture on the right for a visual representation of an ontology.
To identify the subject of a text, I need to see what terms that are described in an ontology appear in a text. This leaves me with multiple concepts, of which I need to find the ‘common denominator’. To do this, I have to measure the similarity (or inversely: the distance) between two concepts: if I find a bunch of very similar concepts in one text, I can be more confident of the subject.
Luckily, a lot of people have dealt with this ‘ontology-based semantic similarity measurement’. I gathered and studied a couple of papers, and provide a quick overview of my findings. See my literature list for a more complete overview.
DISCLAIMER: This is by no means intended to be an exhaustive overview. It’s short. I’m sure I’ve not read every relevant paper. Due to time constraints my priority is finding a suitable method to carry on and checking to see if the direction I’m heading is OK (it seems that way). My overview deals with global approaches only, nothing too specific. If I’ve missed anything really obvious, I’d be grateful if you could leave a comment :).
There are two main approaches in ontology-based semantic similarity measurement: edge-based (also called structural or hierarchical approach) and node-based (also called information-content approach).
Edge-based approaches take the structure of the network as a base, focussing on the connections between nodes and their implications/meanings. In edge-based semantic similarity measurement, there are three main principles (which are fortunately pretty much globally agreed-upon – at least in the papers I found):
Shortest-path length between nodes
The most direct approach: the closer two nodes are in the network, the more similar they are. Important detail: path-length is measured by counting (only!) the nodes which have a ‘is_a’ or ‘part_of’ relation. The most primitive semantic similarity measures use only path lengths. However, this shortest-path measure can be extended with:
Node ‘depth’ (aka specifity)
The deeper a node is (farther away from the root), the more specific it is. In most papers this does not revolve around individual node’s depths, but around the depth of their Least Common Subsumer (LCS). The LCS is the deepest ‘shared parent’ of two nodes. The depth is defined as the amount of nodes the LCS is separated from the root concept. The deeper the LCS is, the more similar the concepts are. Also, the granularity of an individual concept has to be considered in calculating its specifity (more granular means more ‘subdivisions’, means a more specific concept). This is usually modeled as an extra variable that influences the concept’s specifity. This means that a highly granular node will be less similar to a less granular node.
Link’s direction Ontologies are directed graphs: a connection between two nodes has a direction (chair is_a furniture, does not work the other way around). The more changes in direction the path has between two nodes, the less similar the nodes are.
Node-based measures do not take the connections in the network as a main resource, but rather the information inside and surrounding the nodes. Text-mining and textual analysis techniques can be applied here. For example by comparing both concepts’ textual data, the concept’s contexts, or by comparing the similarity of the concepts’ LCS to the individual concepts. In these cases, a node or node’s context is frequently represented as a ‘bag of words‘, disregarding any form of grammar or semantics. Cosine similarity is a measure which is often used for textually comparing two texts. A common approach to weigh the importance of words in a text is the TF-IDF (term frequency–inverse document frequency) measure.
Other techniques involve counting the amount of surrounding nodes (in a way similar to checking a node’s granularity), the depth of a node (counting the amount of hops from the root-node), etc. The way I see it, a node-based approach is a useful extension on an edge-based approach.
Both these approaches (edge & node-based) are applied in a multitude of algorithms combining edge & node-based, or dealing with either one of the two. For an overview of some common algorithms and their use of edge- and/or node-based approaches, I highly recommend .
What’s left for me is to formulate an approach: picking an edge-based similarity measurement algorithm and implementing it, and finding a node-based approach to extend the edge-based approach with.
For the edge-based algorithm, the bare essentials are already in place:
A breadth-first search algorithm to determine paths between nodes
A method of finding the common parent (LCS) of two nodes
A method of counting the depth of a node
A method of exploring the context of a node
Then, I am looking to extend the structure-based approach by including a similarity-comparison of the linguistic data ‘surrounding’ each node. By retrieving all surrounding nodes of both nodes that I’m comparing, throwing all textual data of the surrounding nodes in a ‘bag of words’, and comparing this bag of words to the second node’s bag of words. This is also called a node-based similarity measure (as opposed to the previously described ‘edge-based’ measure). I will also look into combining this text-comparison system with keyword extraction.
An extremely useful Python framework for all text-comparison tasks at hand (BOW model, cosine similarity measures, TF-IDF) I came across is Gensim. It has all the features I could want to use, plus excellent documentation.
“Gensim is a Python framework designed to automatically extract semantic topics from documents, as naturally and painlessly as possible.”
Pesquita C, Faria D, Falcão AO, Lord P, Couto FM (2009) Semantic Similarity in Biomedical Ontologies. PLoS Comput Biol 5(7): e1000443. doi:10.1371/journal.pcbi.1000443
Al-Mubaid H, Nguyen HA. “A cluster-based approach for semantic similarity in the biomedical domain”, Conf Proc IEEE Eng Med Biol Soc. 2006;1:2713-7.
A. Bramantoro, S. Krishnaswamy, and M. Indrawan, “A Semantic Distance Measure for Matching Web Services”, in Proceedings of WISE Workshops 2005. pp.217~226
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proc. IJCAI, 2007
M. Andrea Rodríguez, Max J. Egenhofer, “Determining Semantic Similarity among Entity Classes from Different Ontologies,” IEEE Transactions on Knowledge and Data Engineering, pp. 442-456, March/April, 2003
Lee, W N et al. “Comparison of ontology-based semantic-similarity measures.” AMIA Annu Symp Proc 2008 (2008) : 384-388.
I. Spasic, S. Ananiadou, J. McNaught, and A. Kumar, “Text mining and ontologies in biomedicine: Making sense of raw text”, presented at Briefings in Bioinformatics, 2005, pp.239-251.
P. Resnik. “Using information content to evaluate semantic similarity in a taxonomy”, In Proceedings of the 14th international joint conference on Artificial intelligence – Volume 1 (IJCAI’95), pp. 448-453. 1995.
R. Thiagarajan, G. Manjunath, and M. Stumptner. “Computing Semantic Similarity Using Ontologies”, HP Labs, 2008