As I blogged previously, I am working on measuring the performance of my keyword extraction algorithms. The confusion matrix approach I have implemented is quite ‘harsh’. It ignores any semantic information and simply treats the concepts as words, and counts hits and misses between two sets of concepts.
To benefit from the semantic information described in the NCI Thesaurus, and thus produce more detailed results, I will measure the algorithm’s performance by measuring the semantic similarity between the lists of concepts. The two lists (expert data & algorithm) are treated as subgraphs within the main graph: the NCI Thesaurus. Their similarity is measured with a path-based semantic similarity metric, of which there are several. I have implemented Leacock & Chodorow’s measure, as in the literature I found it consistently outperforms similar path-based metrics in the Biomedical domain. Speaking of domain; this measure has originally been designed for WordNet (as many of the other metrics), but has also been used and validated in the Biomedical domain. Hooray for domain-independent, unsupervised and corpus-free approaches to similarity measurement ;-). (more…)
While I haven’t been as active and hard working on my graduation project as I would have liked to be, I am not dead (nor the project). Earlier this week I presented my project to the Bio-imaging group of Leiden University, which helped me a lot. I was able to present my project pretty much as-is, since I’m mostly done with the technical parts. I received valuable feedback and got good insights into what I should explain more thoroughly in the presentation. (more…)
First results are in! My computer spent about 20 hours to retrieve and store neighboring concepts of over 10.000 concepts which my Breadth-First-Search algorithm passed through to find the shortest paths between 6 nodes. But here is the result, first the ‘before’ graph, which I showed earlier: all retrieved concepts with their parent relations. Below that is the new graph, which relates all concepts by finding their shortest paths (so far only the orange concepts – from the Gene Ontology).
What were two separate clusters in the first to the left is now one big fat cluster… Which is cool!
Less cool is the time it took… But oh well, looks like I’m going to have to prepare some examples as proof of concepts. Nowhere near realistic realtime performance so far… (however I got a big speed increase by moving my Sesame triple store from my ancient EeePC900 to my desktop computer… Goodbye supercomputer). The good news is that all neighboring nodes I processed so far are cached in a local SQLite database, so those 20 hours were not a waste! (considering my total ontology database consists out of over 800.000 concepts, and 10.000 concepts took 20 hours, is something I choose not to take into consideration however :p).
It is important to note that the meaning or interpretation of the resulting graph (and particularly the relations between concepts) is not the primary concern here: the paths (their lengths, the directions of the edges and the node’s ‘depths’) will be primarily used for the ontology-based semantic similarity measure I wrote about in this post.
The first steps towards creating graph-based text visualizations is thinking about how I would like to visually represent the structure of the information I extract from texts. I have included a preliminary example of the output at the bottom of this page. I use three different methods of extracting concepts from a piece of text:
Counting literal occurrences of concepts (from ontologies)
Finding related concepts by textual comparison of the text to the concepts’ descriptions
Finding related concepts by exploring the ontological structure (aka relating concepts within one ontology by finding paths and parents, and possibly relevant neighbours)
The primary distinction I want to make is that of relevance (aka ‘likeliness of topic’). In the case of 1 this would be the frequency of the word (more occurrences = more relevant concept), in the case of 2 this is the calculated similarity of the source text to the concepts’ descriptions (which is a number between 0 and 1). In the case of 3, this would be by ‘connections’: the more concepts the concepts I find by exploring an ontologies’ structure would link, the more relevant this found concept would be. I want to model this distinction by node’s size: the more relevant a concept is, the bigger I want to draw the concept in the graph.
The second distinction is that of literal to non-literal (1 being literal, 2 and 3 being non-literal). I want to model this distinction by style: literal concepts will be drawn as a filled circle, non-literal concepts as outlined circles.
The third distinction is that of the concepts’ source: from which ontology does a concept originate? This distinction will be modeled by color: each ontology (of the six I use) will have its distinct color. Explored concepts (from step 3) such as parents and shared parents will be colored in distinct colours as well, since they are connected in the graph to the coloured nodes, their source will implicitly be clear.
Since Gephi doesn’t fully support Graphviz’ DOT-language, and the graph library I use in Python conveniently parses graphs in DOT, I use Graphviz to directly render the results.
An issue I came across with the scaling (to represent relevance), is that I’m working with multiple measures: frequency of literal words (1), percentage of text-similarity (2), and degree-count. In an effort to roughly equalize the scaling factor, I decided to use static counters. Each node gets an initial size of 0.5 (0.4 for shared parents, and 0.3 for parents). For each occurrence of a literal word, I add 0.05. For the text-similarity, I add the percentage of 0.5 (26% similarity = 0.5 + (0.5*0.26)). For the degree, I add 0.1 for each in-link the node receives. This is an initial attempt at unifying the results. Anyway, these are just settings I’m playing around with.
These are very rough examples, because:
I use the literal representation of the input text (as I have not yet determined the most suitable keyword extraction method)
I haven’t determined a proper ‘cut-off’ for the text similarity measure; currently it includes every concept it finds with similarity ≥ 25%.
It doesn’t yet fully incorporate step 3 (it includes parents, but not yet paths between nodes)
It doesn’t scale nodes according to in-links
There is no filtering applied yet (removing obsolete classes, for example).
To my right is a visualization of the output of my SPARQL-powered shortest path algorithm, finding a link between ‘intracellular and extracellular accumulation’ & ‘developmental and adult structural defect’, 2 concepts in the Mouse Pathology ontology. Click it! It shows the two ‘source’ concepts in white, and the shortest path (of 3 nodes: 4 hops) in grey. It looks like this in python:
My algorithm generates a directed subgraph of all the nodes and edges it encounters during its search for a path between two ontology concepts. I figured generating this subgraph would make it easier to get some of the variables I need for the semantic similarity measurement calculation (such as amount changes in directions in the path, node’s depth from the root, etc.). Furthermore, I can use the subgraph to more easily assign weights to the textual data surrounding the nodes, when assembling my bag-of-words model of a node’s direct context, as I’ve explained in my previous post.
What to use
There are heaps of libraries for managing graphs in Python, and loads of programs to visualize and manipulate graphs. Here’s my stuff of choice.
After looking at several python-graph libraries ( NetworkX, igraph, graph-tool, etc.) I choose to use python-graph, which in my opinion is the most straightforward, compact, simple yet powerful graph-lib for python. I use python-graph to generate a file describing the subgraph in DOT language (“plain text graph description language”). This file can then be imported in a wide array of programs to visualize and manipulate the graph.
Visualizing the subgraph containing the shortest path between two nodes would allow me to get a better picture of what my algorithm fetches, and also to double-check the results (seeing is believing ;)). To visualize graphs, there are plenty of options again. After sifting through Tulip, Graphviz and some other obscure programs I stumbled upon Gephi, a very complete, pretty and simple open-source graph visualization & manipulation program. It has extensive documentation, and several advanced features to manipulate the graph and fetch some values. Ideally though, I will manage all those ‘advanced value-fetching tasks’ in my python script. Gephi still provides a nice and quick way to double-check some of the output and get a more concrete idea about what’s happening, as things can get pretty complex, pretty fast:
1136 nodes of the DOID-ontology, subgraph produced when finding a link for DOID_4 (disease) and DOID_10652 (Alzheimer's Disease) - red nodes in the graph. Linked by the three yellow nodes.
12/12/12 update: since @sem_web moved to live in my Raspberry Pi, I’ve renamed him @grausPi
The last couple of days I’ve spent working on my graduation project by working on a side-project: @sem_web; a Twitter-bot who queries DBPedia [wikipedia’s ‘linked data’ equivalent] for knowledge.
@sem_web is able to recognize 249 concepts, defined by the DBPedia ontology, and sends SPARQL queries to the DBPedia endpoint to retrieve more specific information about them. Currently, this means that @sem_web can check an incoming tweet (mention) for known concepts, and then return an instance (example) of the concept, along with a property of this instance, and the value for the property. An example of Sam’s output:
[findConcept] findConcept('video game')
[findConcept] Looking for concept: video game
[findInst] Seed: [u'http://dbpedia.org/class/yago/ComputerGame100458890',
[findInst] Has 367 instances.
[findInst] Instance: Fight Night Round 3
[findProp] Has 11 properties.
[findProp] [u'http://dbpedia.org/property/platforms', u'platforms']
[findVal] Property: platforms (has 1 values)
[findVal] Value: Xbox 360, Xbox, PSP, PS2, PS3
[findVal] Domain: [u'Thing', u'work', u'software']
[findVal] We're talking about a thing...
Fight Night Round 3 is a video game. Its platforms is Xbox 360, Xbox,
PSP, PS2, PS3.
This is how it works:
Look for words occurring in the tweet that match a given concept’s label.
If found (concept): send a SPARQL query to retrieve an instance of the concept (an object with rdf:type concept).
If not found: send a SPARQL query to retrieve a subClass of the concept. Go to step 1 with subClass as concept.
If found (instance): send SPARQL queries to retrieve a property, value and domain of the instance. The domain is used to determine whether @sem_web is talking about a human or a thing.
If no property with a value is found after several tries: Go to step 2 to retrieve a new instance.
Compose a sentence (currently @sem_web has 4 different sentences) with the information (concept, instance, property, value).
Next to that, @sem_web posts random tweets once an hour, by picking a random concept from the DBPedia ontology. Working on @sem_web allows me to get to grips with both the SPARQL query language, and programming in Python (which, still, is something I haven’t done before in a larger-than-20-lines-of-code way).
What I’m working on next is a method to compare multiple concepts, when @sem_web detects more than one in a tweet. Currently, this works by taking each concept and querying for all the superClasses of the concept. I then store the path from the seed to the topClass (Entity) in a list, repeat the process for the next concept, and then compare both paths to the top, to identify a common parent-Class.
This is relevant for my graduation project as well, because a large task in determining the right subject for a text will be to determine the ‘proximity’ or similarity of different concepts in the text. Still, that specific task of determining ‘similarity’ or proximity of concepts is a much bigger thing, finding common superClasses is just a tiny step towards it. There are other interesting relationships to explore, for example partOf/sameAs relations. I’m curious to see what kind of information I will gather with this from larger texts.
An example of the concept comparison in action. From the following tweet:
Picked mendicot: @offbeattravel .. FYI, my Twitter bot
@vagabot found you by parsing (and attempting to answer)
travel questions off the Twitter firehose ..
The findCommonParent function takes two URIs and processes them, appending a new list with the superClasses of the initial URI. This way I can track all the ‘hops’ made by counting the list number. As soon as the function processed both URIs, it starts comparing the pathLists to determine the first common parent.
Here you can see the first common parentClass is ‘Event’: 3 hops away from ‘ChangeOfLocation’, and 5 hops away from ‘Locomotion’. If it finds multiple superClasses, it will process multiple URIs at the same time (in one list). Anyway, this is just the basic stuff. There’s plenty more on my to-do list…
While the major part of the functionality I’m building for @sem_web will be directly usable for my thesis project, I haven’t been sitting still with more directly thesis-related things either. I’ve set up a local RDF store (Sesame store) on my laptop with all the needed bio-ontologies. RDFLib’s in-memory stores were clearly not up for the large ontologies I had to load each time. This also means I have to better structure my queries, as all information is not available at any given time. I also – unfortunately – learned that one of my initial plans: finding the shortest path between two nodes in an RDF store to determine ‘proximity’, is actually quite a complicated task. Next I will focus more on improving the concept comparison, taking more properties into account than only rdfs:subClass, and I’ll also work on extracting keywords (which I haven’t, but should have arranged testing data for)… Till next time!
But mostly, the last weeks I’ve been learning SPARQL, improving my Python skills, and getting a better and more concrete idea of the possible approaches for my thesis project by working on sem_web.