First results are in! My computer spent about 20 hours to retrieve and store neighboring concepts of over 10.000 concepts which my Breadth-First-Search algorithm passed through to find the shortest paths between 6 nodes. But here is the result, first the ‘before’ graph, which I showed earlier: all retrieved concepts with their parent relations. Below that is the new graph, which relates all concepts by finding their shortest paths (so far only the orange concepts – from the Gene Ontology).
What were two separate clusters in the first to the left is now one big fat cluster… Which is cool!
Less cool is the time it took… But oh well, looks like I’m going to have to prepare some examples as proof of concepts. Nowhere near realistic realtime performance so far… (however I got a big speed increase by moving my Sesame triple store from my ancient EeePC900 to my desktop computer… Goodbye supercomputer). The good news is that all neighboring nodes I processed so far are cached in a local SQLite database, so those 20 hours were not a waste! (considering my total ontology database consists out of over 800.000 concepts, and 10.000 concepts took 20 hours, is something I choose not to take into consideration however :p).
It is important to note that the meaning or interpretation of the resulting graph (and particularly the relations between concepts) is not the primary concern here: the paths (their lengths, the directions of the edges and the node’s ‘depths’) will be primarily used for the ontology-based semantic similarity measure I wrote about in this post.
As promised, I have spent the last two weeks generating a lot (but not quite 120) results. So let’s take a quick look at what I’ve done and found.
First of all, the Cyttron DB. Here I show 4 different methods of representing the Cyttron database, the 1st is as-is (literal), the 2nd by keyword extraction (10 most frequently occurring words, after filtering for stopwords), the 3rd is by generating synonyms with WordNet for each word in the database, the 4th is by generating synonyms with WordNet for each word of the keyword representation.
Next up, the Wikipedia-page for Alzheimer’s disease. Here I have used the literal text, the 10 most frequently occurring bigrams (2-word words), the 10 most frequently occurring trigrams (3-word words), the 10 most frequently occurring keywords (after stopwords filtering) and the WordNet-boosted text (generating synonyms with WordNet for each word).
The other approach, using the ontologies’ term’s descriptions didn’t quite fare as well as I’d hoped. I used Python’s built-in difflib module, which at the time seemed like the right module to use, but after closer inspection did not quite get the results I was looking for. The next plan is to take a more simple approach, by extracting keywords from the description texts to use as a counting measure in much the same way I do the literal matching.
All the results I generated are hard to evaluate, as long as I do not have a method to measure the relations between the found labels. More labels is not necesarily better, more relevant labels is the goal. When I ‘WordNet’-boost a text (aka generate a bunch of synonyms for each word I find), I do get a lot more literal matches: but I will only know if this makes determining the subject easier or harder once I have a method to relate all found labels to each other and maybe find a cluster of terms which occur frequently.
I am now working on a simple breadth-first search algorithm, which takes a ‘start’-node and a ‘goal’-node, queries for direct neighbours of the start-node one ‘hop’ at a time, until it reaches the goal-node. It will then be possible to determine the relation between two nodes. Note that this will only work within one ontology, if the most frequent terms come from different ontologies, I am forced to use simple linguistic matching (as I am doing now), to determine ‘relatedness’. But as the ontologies all have a distinct field, I imagine the most frequent terms will most likely come from one ontology.
So, after I’ve finished the BFS algorithm, I will have to determine the final keyword-extraction methods, and ways of representing the source data. My current keyword-extraction methods (word frequency and bi/trigrams) rely on a large body of reference material, the longer the DB entry the more effective these methods are (or at least, the more ‘right’ the extracted keywords are, most frequent trigrams from a 10-word entry makes no sense).
Matching terms from ontologies is much more suited for smaller texts. And because of the specificity of the ontologies’ domain, there is an automatic filter of non-relevant words. Bio-ontologies contain biological terms: matching a text to those automatically keeps only the words I’m looking for. The only problem is that you potentially miss out on words which are missing from the ontology, which is an important part of my thesis.
Ideally, the final implementation will use both approaches; ontology matching to quickly find the relevant words, calculate relations, and then keyword extraction to double-check if no important or relevant words have been skipped.
To generate the next bunch of results, I am going to limit the size of both the reference-ontologies as the source data. As WordNet-boosted literal term-matching took well over 20 hours on my laptop, I will limit the ontologies to 1 or 2, and will select around 10 represenatitive Cyttron DB-entries.
I am now running my Sesame RDF Store on my eee-pc (which also hosts @sem_web), which is running 24/7 and accessible from both my computers (desktop and laptop)! Also, I am now on GitHub. There’s more results there, check me out » http://github.com/dvdgrs/thesis.
12/12/12 update: since @sem_web moved to live in my Raspberry Pi, I’ve renamed him @grausPi
The last couple of days I’ve spent working on my graduation project by working on a side-project: @sem_web; a Twitter-bot who queries DBPedia [wikipedia’s ‘linked data’ equivalent] for knowledge.
@sem_web is able to recognize 249 concepts, defined by the DBPedia ontology, and sends SPARQL queries to the DBPedia endpoint to retrieve more specific information about them. Currently, this means that @sem_web can check an incoming tweet (mention) for known concepts, and then return an instance (example) of the concept, along with a property of this instance, and the value for the property. An example of Sam’s output:
[findConcept] findConcept('video game')
[findConcept] Looking for concept: video game
[findInst] Seed: [u'http://dbpedia.org/class/yago/ComputerGame100458890',
[findInst] Has 367 instances.
[findInst] Instance: Fight Night Round 3
[findProp] Has 11 properties.
[findProp] [u'http://dbpedia.org/property/platforms', u'platforms']
[findVal] Property: platforms (has 1 values)
[findVal] Value: Xbox 360, Xbox, PSP, PS2, PS3
[findVal] Domain: [u'Thing', u'work', u'software']
[findVal] We're talking about a thing...
Fight Night Round 3 is a video game. Its platforms is Xbox 360, Xbox,
PSP, PS2, PS3.
This is how it works:
Look for words occurring in the tweet that match a given concept’s label.
If found (concept): send a SPARQL query to retrieve an instance of the concept (an object with rdf:type concept).
If not found: send a SPARQL query to retrieve a subClass of the concept. Go to step 1 with subClass as concept.
If found (instance): send SPARQL queries to retrieve a property, value and domain of the instance. The domain is used to determine whether @sem_web is talking about a human or a thing.
If no property with a value is found after several tries: Go to step 2 to retrieve a new instance.
Compose a sentence (currently @sem_web has 4 different sentences) with the information (concept, instance, property, value).
Next to that, @sem_web posts random tweets once an hour, by picking a random concept from the DBPedia ontology. Working on @sem_web allows me to get to grips with both the SPARQL query language, and programming in Python (which, still, is something I haven’t done before in a larger-than-20-lines-of-code way).
What I’m working on next is a method to compare multiple concepts, when @sem_web detects more than one in a tweet. Currently, this works by taking each concept and querying for all the superClasses of the concept. I then store the path from the seed to the topClass (Entity) in a list, repeat the process for the next concept, and then compare both paths to the top, to identify a common parent-Class.
This is relevant for my graduation project as well, because a large task in determining the right subject for a text will be to determine the ‘proximity’ or similarity of different concepts in the text. Still, that specific task of determining ‘similarity’ or proximity of concepts is a much bigger thing, finding common superClasses is just a tiny step towards it. There are other interesting relationships to explore, for example partOf/sameAs relations. I’m curious to see what kind of information I will gather with this from larger texts.
An example of the concept comparison in action. From the following tweet:
Picked mendicot: @offbeattravel .. FYI, my Twitter bot
@vagabot found you by parsing (and attempting to answer)
travel questions off the Twitter firehose ..
The findCommonParent function takes two URIs and processes them, appending a new list with the superClasses of the initial URI. This way I can track all the ‘hops’ made by counting the list number. As soon as the function processed both URIs, it starts comparing the pathLists to determine the first common parent.
Here you can see the first common parentClass is ‘Event’: 3 hops away from ‘ChangeOfLocation’, and 5 hops away from ‘Locomotion’. If it finds multiple superClasses, it will process multiple URIs at the same time (in one list). Anyway, this is just the basic stuff. There’s plenty more on my to-do list…
While the major part of the functionality I’m building for @sem_web will be directly usable for my thesis project, I haven’t been sitting still with more directly thesis-related things either. I’ve set up a local RDF store (Sesame store) on my laptop with all the needed bio-ontologies. RDFLib’s in-memory stores were clearly not up for the large ontologies I had to load each time. This also means I have to better structure my queries, as all information is not available at any given time. I also – unfortunately – learned that one of my initial plans: finding the shortest path between two nodes in an RDF store to determine ‘proximity’, is actually quite a complicated task. Next I will focus more on improving the concept comparison, taking more properties into account than only rdfs:subClass, and I’ll also work on extracting keywords (which I haven’t, but should have arranged testing data for)… Till next time!
But mostly, the last weeks I’ve been learning SPARQL, improving my Python skills, and getting a better and more concrete idea of the possible approaches for my thesis project by working on sem_web.