First results are in! My computer spent about 20 hours to retrieve and store neighboring concepts of over 10.000 concepts which my Breadth-First-Search algorithm passed through to find the shortest paths between 6 nodes. But here is the result, first the ‘before’ graph, which I showed earlier: all retrieved concepts with their parent relations. Below that is the new graph, which relates all concepts by finding their shortest paths (so far only the orange concepts – from the Gene Ontology).
What were two separate clusters in the first to the left is now one big fat cluster… Which is cool!
Less cool is the time it took… But oh well, looks like I’m going to have to prepare some examples as proof of concepts. Nowhere near realistic realtime performance so far… (however I got a big speed increase by moving my Sesame triple store from my ancient EeePC900 to my desktop computer… Goodbye supercomputer). The good news is that all neighboring nodes I processed so far are cached in a local SQLite database, so those 20 hours were not a waste! (considering my total ontology database consists out of over 800.000 concepts, and 10.000 concepts took 20 hours, is something I choose not to take into consideration however :p).
It is important to note that the meaning or interpretation of the resulting graph (and particularly the relations between concepts) is not the primary concern here: the paths (their lengths, the directions of the edges and the node’s ‘depths’) will be primarily used for the ontology-based semantic similarity measure I wrote about in this post.
My thesis is about keyword extraction of biological notes, using semantic ‘dictionaries’ called ontologies. These ontologies are large networks, where each node stands for a concept, and each connection between nodes for relations. See the picture on the right for a visual representation of an ontology.
To identify the subject of a text, I need to see what terms that are described in an ontology appear in a text. This leaves me with multiple concepts, of which I need to find the ‘common denominator’. To do this, I have to measure the similarity (or inversely: the distance) between two concepts: if I find a bunch of very similar concepts in one text, I can be more confident of the subject.
Luckily, a lot of people have dealt with this ‘ontology-based semantic similarity measurement’. I gathered and studied a couple of papers, and provide a quick overview of my findings. See my literature list for a more complete overview.
DISCLAIMER: This is by no means intended to be an exhaustive overview. It’s short. I’m sure I’ve not read every relevant paper. Due to time constraints my priority is finding a suitable method to carry on and checking to see if the direction I’m heading is OK (it seems that way). My overview deals with global approaches only, nothing too specific. If I’ve missed anything really obvious, I’d be grateful if you could leave a comment :).
There are two main approaches in ontology-based semantic similarity measurement: edge-based (also called structural or hierarchical approach) and node-based (also called information-content approach).
Edge-based approaches take the structure of the network as a base, focussing on the connections between nodes and their implications/meanings. In edge-based semantic similarity measurement, there are three main principles (which are fortunately pretty much globally agreed-upon – at least in the papers I found):
Shortest-path length between nodes
The most direct approach: the closer two nodes are in the network, the more similar they are. Important detail: path-length is measured by counting (only!) the nodes which have a ‘is_a’ or ‘part_of’ relation. The most primitive semantic similarity measures use only path lengths. However, this shortest-path measure can be extended with:
Node ‘depth’ (aka specifity)
The deeper a node is (farther away from the root), the more specific it is. In most papers this does not revolve around individual node’s depths, but around the depth of their Least Common Subsumer (LCS). The LCS is the deepest ‘shared parent’ of two nodes. The depth is defined as the amount of nodes the LCS is separated from the root concept. The deeper the LCS is, the more similar the concepts are. Also, the granularity of an individual concept has to be considered in calculating its specifity (more granular means more ‘subdivisions’, means a more specific concept). This is usually modeled as an extra variable that influences the concept’s specifity. This means that a highly granular node will be less similar to a less granular node.
Link’s direction Ontologies are directed graphs: a connection between two nodes has a direction (chair is_a furniture, does not work the other way around). The more changes in direction the path has between two nodes, the less similar the nodes are.
Node-based measures do not take the connections in the network as a main resource, but rather the information inside and surrounding the nodes. Text-mining and textual analysis techniques can be applied here. For example by comparing both concepts’ textual data, the concept’s contexts, or by comparing the similarity of the concepts’ LCS to the individual concepts. In these cases, a node or node’s context is frequently represented as a ‘bag of words‘, disregarding any form of grammar or semantics. Cosine similarity is a measure which is often used for textually comparing two texts. A common approach to weigh the importance of words in a text is the TF-IDF (term frequency–inverse document frequency) measure.
Other techniques involve counting the amount of surrounding nodes (in a way similar to checking a node’s granularity), the depth of a node (counting the amount of hops from the root-node), etc. The way I see it, a node-based approach is a useful extension on an edge-based approach.
Both these approaches (edge & node-based) are applied in a multitude of algorithms combining edge & node-based, or dealing with either one of the two. For an overview of some common algorithms and their use of edge- and/or node-based approaches, I highly recommend .
What’s left for me is to formulate an approach: picking an edge-based similarity measurement algorithm and implementing it, and finding a node-based approach to extend the edge-based approach with.
For the edge-based algorithm, the bare essentials are already in place:
A breadth-first search algorithm to determine paths between nodes
A method of finding the common parent (LCS) of two nodes
A method of counting the depth of a node
A method of exploring the context of a node
Then, I am looking to extend the structure-based approach by including a similarity-comparison of the linguistic data ‘surrounding’ each node. By retrieving all surrounding nodes of both nodes that I’m comparing, throwing all textual data of the surrounding nodes in a ‘bag of words’, and comparing this bag of words to the second node’s bag of words. This is also called a node-based similarity measure (as opposed to the previously described ‘edge-based’ measure). I will also look into combining this text-comparison system with keyword extraction.
An extremely useful Python framework for all text-comparison tasks at hand (BOW model, cosine similarity measures, TF-IDF) I came across is Gensim. It has all the features I could want to use, plus excellent documentation.
“Gensim is a Python framework designed to automatically extract semantic topics from documents, as naturally and painlessly as possible.”
Pesquita C, Faria D, Falcão AO, Lord P, Couto FM (2009) Semantic Similarity in Biomedical Ontologies. PLoS Comput Biol 5(7): e1000443. doi:10.1371/journal.pcbi.1000443
Al-Mubaid H, Nguyen HA. “A cluster-based approach for semantic similarity in the biomedical domain”, Conf Proc IEEE Eng Med Biol Soc. 2006;1:2713-7.
A. Bramantoro, S. Krishnaswamy, and M. Indrawan, “A Semantic Distance Measure for Matching Web Services”, in Proceedings of WISE Workshops 2005. pp.217~226
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proc. IJCAI, 2007
M. Andrea Rodríguez, Max J. Egenhofer, “Determining Semantic Similarity among Entity Classes from Different Ontologies,” IEEE Transactions on Knowledge and Data Engineering, pp. 442-456, March/April, 2003
Lee, W N et al. “Comparison of ontology-based semantic-similarity measures.” AMIA Annu Symp Proc 2008 (2008) : 384-388.
I. Spasic, S. Ananiadou, J. McNaught, and A. Kumar, “Text mining and ontologies in biomedicine: Making sense of raw text”, presented at Briefings in Bioinformatics, 2005, pp.239-251.
P. Resnik. “Using information content to evaluate semantic similarity in a taxonomy”, In Proceedings of the 14th international joint conference on Artificial intelligence – Volume 1 (IJCAI’95), pp. 448-453. 1995.
R. Thiagarajan, G. Manjunath, and M. Stumptner. “Computing Semantic Similarity Using Ontologies”, HP Labs, 2008
As promised, I have spent the last two weeks generating a lot (but not quite 120) results. So let’s take a quick look at what I’ve done and found.
First of all, the Cyttron DB. Here I show 4 different methods of representing the Cyttron database, the 1st is as-is (literal), the 2nd by keyword extraction (10 most frequently occurring words, after filtering for stopwords), the 3rd is by generating synonyms with WordNet for each word in the database, the 4th is by generating synonyms with WordNet for each word of the keyword representation.
Next up, the Wikipedia-page for Alzheimer’s disease. Here I have used the literal text, the 10 most frequently occurring bigrams (2-word words), the 10 most frequently occurring trigrams (3-word words), the 10 most frequently occurring keywords (after stopwords filtering) and the WordNet-boosted text (generating synonyms with WordNet for each word).
The other approach, using the ontologies’ term’s descriptions didn’t quite fare as well as I’d hoped. I used Python’s built-in difflib module, which at the time seemed like the right module to use, but after closer inspection did not quite get the results I was looking for. The next plan is to take a more simple approach, by extracting keywords from the description texts to use as a counting measure in much the same way I do the literal matching.
All the results I generated are hard to evaluate, as long as I do not have a method to measure the relations between the found labels. More labels is not necesarily better, more relevant labels is the goal. When I ‘WordNet’-boost a text (aka generate a bunch of synonyms for each word I find), I do get a lot more literal matches: but I will only know if this makes determining the subject easier or harder once I have a method to relate all found labels to each other and maybe find a cluster of terms which occur frequently.
I am now working on a simple breadth-first search algorithm, which takes a ‘start’-node and a ‘goal’-node, queries for direct neighbours of the start-node one ‘hop’ at a time, until it reaches the goal-node. It will then be possible to determine the relation between two nodes. Note that this will only work within one ontology, if the most frequent terms come from different ontologies, I am forced to use simple linguistic matching (as I am doing now), to determine ‘relatedness’. But as the ontologies all have a distinct field, I imagine the most frequent terms will most likely come from one ontology.
So, after I’ve finished the BFS algorithm, I will have to determine the final keyword-extraction methods, and ways of representing the source data. My current keyword-extraction methods (word frequency and bi/trigrams) rely on a large body of reference material, the longer the DB entry the more effective these methods are (or at least, the more ‘right’ the extracted keywords are, most frequent trigrams from a 10-word entry makes no sense).
Matching terms from ontologies is much more suited for smaller texts. And because of the specificity of the ontologies’ domain, there is an automatic filter of non-relevant words. Bio-ontologies contain biological terms: matching a text to those automatically keeps only the words I’m looking for. The only problem is that you potentially miss out on words which are missing from the ontology, which is an important part of my thesis.
Ideally, the final implementation will use both approaches; ontology matching to quickly find the relevant words, calculate relations, and then keyword extraction to double-check if no important or relevant words have been skipped.
To generate the next bunch of results, I am going to limit the size of both the reference-ontologies as the source data. As WordNet-boosted literal term-matching took well over 20 hours on my laptop, I will limit the ontologies to 1 or 2, and will select around 10 represenatitive Cyttron DB-entries.
I am now running my Sesame RDF Store on my eee-pc (which also hosts @sem_web), which is running 24/7 and accessible from both my computers (desktop and laptop)! Also, I am now on GitHub. There’s more results there, check me out » http://github.com/dvdgrs/thesis.