While I haven’t been as active and hard working on my graduation project as I would have liked to be, I am not dead (nor the project). Earlier this week I presented my project to the Bio-imaging group of Leiden University, which helped me a lot. I was able to present my project pretty much as-is, since I’m mostly done with the technical parts. I received valuable feedback and got good insights into what I should explain more thoroughly in the presentation. » Read this post
To my right is a visualization of the output of my SPARQL-powered shortest path algorithm, finding a link between ‘intracellular and extracellular accumulation’ & ‘developmental and adult structural defect’, 2 concepts in the Mouse Pathology ontology. Click it! It shows the two ‘source’ concepts in white, and the shortest path (of 3 nodes: 4 hops) in grey. It looks like this in python:
[[u'http://purl.obolibrary.org/obo/MPATH_56', u'http://www.w3.org/2000/01/rdf-schema#subClassOf', u'http://purl.obolibrary.org/obo/MPATH_55'], [u'http://purl.obolibrary.org/obo/MPATH_55', u'http://www.w3.org/2000/01/rdf-schema#subClassOf', u'http://purl.obolibrary.org/obo/MPATH_603'], [u'http://purl.obolibrary.org/obo/MPATH_1', u'http://www.w3.org/2000/01/rdf-schema#subClassOf', u'http://purl.obolibrary.org/obo/MPATH_603'], ['http://purl.obolibrary.org/obo/MPATH_33', u'http://www.w3.org/2000/01/rdf-schema#subClassOf', u'http://purl.obolibrary.org/obo/MPATH_1']]
Which is clearly less awesome.
My algorithm generates a directed subgraph of all the nodes and edges it encounters during its search for a path between two ontology concepts. I figured generating this subgraph would make it easier to get some of the variables I need for the semantic similarity measurement calculation (such as amount changes in directions in the path, node’s depth from the root, etc.). Furthermore, I can use the subgraph to more easily assign weights to the textual data surrounding the nodes, when assembling my bag-of-words model of a node’s direct context, as I’ve explained in my previous post.
What to use
There are heaps of libraries for managing graphs in Python, and loads of programs to visualize and manipulate graphs. Here’s my stuff of choice.
After looking at several python-graph libraries ( NetworkX, igraph, graph-tool, etc.) I choose to use python-graph, which in my opinion is the most straightforward, compact, simple yet powerful graph-lib for python. I use python-graph to generate a file describing the subgraph in DOT language (“plain text graph description language”). This file can then be imported in a wide array of programs to visualize and manipulate the graph.
Visualizing the subgraph containing the shortest path between two nodes would allow me to get a better picture of what my algorithm fetches, and also to double-check the results (seeing is believing ;)). To visualize graphs, there are plenty of options again. After sifting through Tulip, Graphviz and some other obscure programs I stumbled upon Gephi, a very complete, pretty and simple open-source graph visualization & manipulation program. It has extensive documentation, and several advanced features to manipulate the graph and fetch some values. Ideally though, I will manage all those ‘advanced value-fetching tasks’ in my python script. Gephi still provides a nice and quick way to double-check some of the output and get a more concrete idea about what’s happening, as things can get pretty complex, pretty fast:
As explained in an earlier post, I am working on a simple method of extracting ‘important words’ from a text-entry. The methods I am using at the moment are frequency distributions and word collocations. I’ve bumped into some issues regarding finetuning my methods. Read on for a short explanation of my approaches, and some issues regarding them.
Frequency Distribution: POS-tagging y/n?
Extracting keywords by frequency distribution is nothing more than counting words and sorting the list of words by occurrence. Before doing this, I filter stopwords from the text entry. The short explanation on how I’m doing this (sourcecode available at github):
» Tokenize the text (using NLTK’s WordPunctTokenizer)
» Lowercase all the words
» ‘Clean’ the list by removing common stopwords from the list (using NLTK’s English stopwords-list)
This is straightforward enough, an example of the results (from the WikiPedia page of ‘Apoptosis‘):
>>> cyttron.wikiGet('Apoptosis') Apoptosis in wikiTxt >>> freqWords(cyttron.wikiTxt,15) ['apoptosis', 'cell', '160', 'apoptotic', 'cells', 'caspase', 'death', '.&#', 'proteins', 'tnf', 'bcl', 'protein', 'also', 'caspases']
Earlier I was thinking about using POS-tagging (Part-Of-Speech tagging to identify word-types) in order to only extract frequently occurring nouns. I figured losing relevant adjectives (such as ‘red’ in red blood cell) could be compensated by the word collocations extraction. POS-tagging the tokenized text, and retrieving only the most frequent nouns results in:
>>> freqNouns(cyttron.wikiTxt,15) ['apoptosis', 'cell', 'caspase', 'death', 'protein', 'tnf', 'pathway', 'activation', 'membrane', 'p53', 'response', 'family', 'gene', 'greek']
My problem here is I’m not sure which is ‘better’ (if any of those two), or if I should maybe use a combination of both. Also, I haven’t decided yet how to handle non-alphabetic words. Initially I planned on using regular expressions to filter non-alphabetic strings, but I figured later that it wouldn’t make sense in my case. In the above example, this would omit retrieving ‘p53′: a tumor suppressor protein (P53), which is very relevant.
With earlier playing around with POS-tagging I noticed the precision was not quite high enough to perform chunk extractions (by looking for specific phrases / grammatical constructions). Extracting only nouns does seem to do quite the job, even if I still miss some and get some false positives.
Word Collocations: Stopword filtering y/n?
Collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. I generate bi- and trigram word collocations, which mean ’2-word strings’ and ’3-word strings’. My issue here is whether or not to use stopword filtering. Here are the results of the word collocation function on the same WikiPedia page, the 1st list being the bigram collocations, the 2nd being the trigrams. Example without stopword filtering:
>>> wordCollo(cyttron.wikiTxt,10,clean=False) ['such as', 'cell death', 'of the', 'due to', 'leads to', 'programmed cell', 'has been', 'bone marrow', 'have been', 'an increase'] ['adp ribose polymerase', 'amino acid composition', 'anatomist walther flemming', 'boston biologist robert', 'break itself down', 'combining forms preceded', 'count falls below', 'german scientist carl', 'homologous antagonist killer', 'mdm2 complexes displaces']
Example with stopword filtering:
>>> wordCollo(cyttron.wikiTxt,10,clean=True) ['cell death', 'programmed cell', 'bone marrow', 'university aberdeen', 'calcium concentration', 'adenovirus e1b', 'british journal', 'citation needed', 'highly conserved', 'nitric oxide'] ['adp ribose polymerase', 'agar gel electrophoresis', 'amino acid composition', 'anatomist walther flemming', 'appearance agar gel', 'awarded sydney brenner', 'boston biologist robert', 'carl vogt first', 'ceases respire aerobically', 'closely enough warrant']
As you can see, lots of garbage in the first example, but still some collocations that do not appear in the cleaned version. Similar to the noun-extraction issue with the previous approach, I wonder if I should choose for one of the two, or combine them.
In other news
Finding Gensim has been a life-saver! Instead of using Difflib to compare two strings, I now use a proper text-similarity metric, namely cosine similarity measurement. I do so by creating a TF-IDF weighted corpus out of the (stopwords-cleaned) descriptions of ontology-terms I use, and calculating the cosine similarity between an input string and each entry in the corpus. Gensim makes this all a breeze to do. An example of the ouput:
>>> wikiGet('alzheimer') alzheimer in wikiTxt >>> descMatch(wikiTxt,5) Label: Alzheimer's disease Similarity: 0.236387 Description: A dementia that results in progressive memory loss, impaired thinking, disorientation, and changes in personality and mood starting in late middle age and leads in advanced cases to a profound decline in cognitive and physical functioning and is marked histologically by the degeneration of brain neurons especially in the cerebral cortex and by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized by memory lapses, confusion, emotional instability and progressive loss of mental ability. Label: vascular dementia Similarity: 0.192565 Description: A dementia that involves impairments in cognitive function caused by problems in blood vessels that feed the brain. Label: dementia Similarity: 0.157553 Description: A cognitive disease resulting from a loss of brain function affecting memory, thinking, language, judgement and behavior. Label: cognitive disease Similarity: 0.13909 Description: A disease of mental health that affects cognitive functions including memory processing, perception and problem solving. Label: encephalitis Similarity: 0.138719 Description: Encephalitis is a nervous system infectious disease characterized as an acute inflammation of the brain. The usual cause is a viral infection, but bacteria can also cause it. Cases can range from mild to severe. For mild cases, you could have flu-like symptoms. Serious cases can cause severe headache, sudden fever, drowsiness, vomiting, confusion and seizures.
I’m not sure if the similarity numbers it produces indicate I’m doing something wrong (there’s no high similarity), but intuitively I would say the results do make sense.
My thesis is about keyword extraction of biological notes, using semantic ‘dictionaries’ called ontologies. These ontologies are large networks, where each node stands for a concept, and each connection between nodes for relations. See the picture on the right for a visual representation of an ontology.
To identify the subject of a text, I need to see what terms that are described in an ontology appear in a text. This leaves me with multiple concepts, of which I need to find the ‘common denominator’. To do this, I have to measure the similarity (or inversely: the distance) between two concepts: if I find a bunch of very similar concepts in one text, I can be more confident of the subject.
Luckily, a lot of people have dealt with this ‘ontology-based semantic similarity measurement’. I gathered and studied a couple of papers, and provide a quick overview of my findings. See my literature list for a more complete overview.
DISCLAIMER: This is by no means intended to be an exhaustive overview. It’s short. I’m sure I’ve not read every relevant paper. Due to time constraints my priority is finding a suitable method to carry on and checking to see if the direction I’m heading is OK (it seems that way). My overview deals with global approaches only, nothing too specific. If I’ve missed anything really obvious, I’d be grateful if you could leave a comment :).
There are two main approaches in ontology-based semantic similarity measurement: edge-based (also called structural or hierarchical approach) and node-based (also called information-content approach).
Edge-based approaches take the structure of the network as a base, focussing on the connections between nodes and their implications/meanings. In edge-based semantic similarity measurement, there are three main principles (which are fortunately pretty much globally agreed-upon – at least in the papers I found):
Shortest-path length between nodes
The most direct approach: the closer two nodes are in the network, the more similar they are. Important detail: path-length is measured by counting (only!) the nodes which have a ‘is_a’ or ‘part_of’ relation. The most primitive semantic similarity measures use only path lengths. However, this shortest-path measure can be extended with:
Node ‘depth’ (aka specifity)
The deeper a node is (farther away from the root), the more specific it is. In most papers this does not revolve around individual node’s depths, but around the depth of their Least Common Subsumer (LCS). The LCS is the deepest ‘shared parent’ of two nodes. The depth is defined as the amount of nodes the LCS is separated from the root concept. The deeper the LCS is, the more similar the concepts are. Also, the granularity of an individual concept has to be considered in calculating its specifity (more granular means more ‘subdivisions’, means a more specific concept). This is usually modeled as an extra variable that influences the concept’s specifity. This means that a highly granular node will be less similar to a less granular node.
Ontologies are directed graphs: a connection between two nodes has a direction (chair is_a furniture, does not work the other way around). The more changes in direction the path has between two nodes, the less similar the nodes are.
Node-based measures do not take the connections in the network as a main resource, but rather the information inside and surrounding the nodes. Text-mining and textual analysis techniques can be applied here. For example by comparing both concepts’ textual data, the concept’s contexts, or by comparing the similarity of the concepts’ LCS to the individual concepts. In these cases, a node or node’s context is frequently represented as a ‘bag of words‘, disregarding any form of grammar or semantics. Cosine similarity is a measure which is often used for textually comparing two texts. A common approach to weigh the importance of words in a text is the TF-IDF (term frequency–inverse document frequency) measure.
Other techniques involve counting the amount of surrounding nodes (in a way similar to checking a node’s granularity), the depth of a node (counting the amount of hops from the root-node), etc. The way I see it, a node-based approach is a useful extension on an edge-based approach.
Both these approaches (edge & node-based) are applied in a multitude of algorithms combining edge & node-based, or dealing with either one of the two. For an overview of some common algorithms and their use of edge- and/or node-based approaches, I highly recommend .
For the edge-based algorithm, the bare essentials are already in place:
- A breadth-first search algorithm to determine paths between nodes
- A method of finding the common parent (LCS) of two nodes
- A method of counting the depth of a node
- A method of exploring the context of a node
Then, I am looking to extend the structure-based approach by including a similarity-comparison of the linguistic data ‘surrounding’ each node. By retrieving all surrounding nodes of both nodes that I’m comparing, throwing all textual data of the surrounding nodes in a ‘bag of words’, and comparing this bag of words to the second node’s bag of words. This is also called a node-based similarity measure (as opposed to the previously described ‘edge-based’ measure). I will also look into combining this text-comparison system with keyword extraction.
An extremely useful Python framework for all text-comparison tasks at hand (BOW model, cosine similarity measures, TF-IDF) I came across is Gensim. It has all the features I could want to use, plus excellent documentation.
“Gensim is a Python framework designed to automatically extract semantic topics from documents, as naturally and painlessly as possible.”
- Pesquita C, Faria D, Falcão AO, Lord P, Couto FM (2009) Semantic Similarity in Biomedical Ontologies. PLoS Comput Biol 5(7): e1000443. doi:10.1371/journal.pcbi.1000443
- Al-Mubaid H, Nguyen HA. “A cluster-based approach for semantic similarity in the biomedical domain”, Conf Proc IEEE Eng Med Biol Soc. 2006;1:2713-7.
- A. Bramantoro, S. Krishnaswamy, and M. Indrawan, “A Semantic Distance Measure for Matching Web Services”, in Proceedings of WISE Workshops 2005. pp.217~226
- E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proc. IJCAI, 2007
- M. Andrea Rodríguez, Max J. Egenhofer, “Determining Semantic Similarity among Entity Classes from Different Ontologies,” IEEE Transactions on Knowledge and Data Engineering, pp. 442-456, March/April, 2003
- Lee, W N et al. “Comparison of ontology-based semantic-similarity measures.” AMIA Annu Symp Proc 2008 (2008) : 384-388.
- I. Spasic, S. Ananiadou, J. McNaught, and A. Kumar, “Text mining and ontologies in biomedicine: Making sense of raw text”, presented at Briefings in Bioinformatics, 2005, pp.239-251.
- P. Resnik. “Using information content to evaluate semantic similarity in a taxonomy”, In Proceedings of the 14th international joint conference on Artificial intelligence – Volume 1 (IJCAI’95), pp. 448-453. 1995.
- R. Thiagarajan, G. Manjunath, and M. Stumptner. “Computing Semantic Similarity Using Ontologies”, HP Labs, 2008
(sorry for the messy list)
As promised, I have spent the last two weeks generating a lot (but not quite 120) results. So let’s take a quick look at what I’ve done and found.
First of all, the Cyttron DB. Here I show 4 different methods of representing the Cyttron database, the 1st is as-is (literal), the 2nd by keyword extraction (10 most frequently occurring words, after filtering for stopwords), the 3rd is by generating synonyms with WordNet for each word in the database, the 4th is by generating synonyms with WordNet for each word of the keyword representation.
Next up, the Wikipedia-page for Alzheimer’s disease. Here I have used the literal text, the 10 most frequently occurring bigrams (2-word words), the 10 most frequently occurring trigrams (3-word words), the 10 most frequently occurring keywords (after stopwords filtering) and the WordNet-boosted text (generating synonyms with WordNet for each word).
The other approach, using the ontologies’ term’s descriptions didn’t quite fare as well as I’d hoped. I used Python’s built-in difflib module, which at the time seemed like the right module to use, but after closer inspection did not quite get the results I was looking for. The next plan is to take a more simple approach, by extracting keywords from the description texts to use as a counting measure in much the same way I do the literal matching.
All the results I generated are hard to evaluate, as long as I do not have a method to measure the relations between the found labels. More labels is not necesarily better, more relevant labels is the goal. When I ‘WordNet’-boost a text (aka generate a bunch of synonyms for each word I find), I do get a lot more literal matches: but I will only know if this makes determining the subject easier or harder once I have a method to relate all found labels to each other and maybe find a cluster of terms which occur frequently.
I am now working on a simple breadth-first search algorithm, which takes a ‘start’-node and a ‘goal’-node, queries for direct neighbours of the start-node one ‘hop’ at a time, until it reaches the goal-node. It will then be possible to determine the relation between two nodes. Note that this will only work within one ontology, if the most frequent terms come from different ontologies, I am forced to use simple linguistic matching (as I am doing now), to determine ‘relatedness’. But as the ontologies all have a distinct field, I imagine the most frequent terms will most likely come from one ontology.
So, after I’ve finished the BFS algorithm, I will have to determine the final keyword-extraction methods, and ways of representing the source data. My current keyword-extraction methods (word frequency and bi/trigrams) rely on a large body of reference material, the longer the DB entry the more effective these methods are (or at least, the more ‘right’ the extracted keywords are, most frequent trigrams from a 10-word entry makes no sense).
Matching terms from ontologies is much more suited for smaller texts. And because of the specificity of the ontologies’ domain, there is an automatic filter of non-relevant words. Bio-ontologies contain biological terms: matching a text to those automatically keeps only the words I’m looking for. The only problem is that you potentially miss out on words which are missing from the ontology, which is an important part of my thesis.
Ideally, the final implementation will use both approaches; ontology matching to quickly find the relevant words, calculate relations, and then keyword extraction to double-check if no important or relevant words have been skipped.
To generate the next bunch of results, I am going to limit the size of both the reference-ontologies as the source data. As WordNet-boosted literal term-matching took well over 20 hours on my laptop, I will limit the ontologies to 1 or 2, and will select around 10 represenatitive Cyttron DB-entries.
I am now running my Sesame RDF Store on my eee-pc (which also hosts @sem_web), which is running 24/7 and accessible from both my computers (desktop and laptop)! Also, I am now on GitHub. There’s more results there, check me out » http://github.com/dvdgrs/thesis.
Just a quick update to let myself know what’s going to happen next: It’s time to produce some results! While I was getting quite stuck in figuring out the best – or rather, most practical – way to extract keywords from a text (and not just any text, mind you, but notes of biologists), my supervisor told me it’s time to get some results. Hard figures. I decided to scrap POS-tagging the notes to extract valuable phrases, after I noticed the accuracy of the default NLTK POS-tagger was way below practical usage. Not too surprising, considering the default NLTK tagger is probably not trained on biologists’ notes.
Anyway, we came up with the following tests:
I use two sources:
- The first being the biologist’s notes (the Cyttron DB).
- The second being specific Wikipedia pages on recurring topics of the Cyttron DB:
Alzheimer, Apoptis, Tau protein & Zebrafish.
From these two sources, I will use five different methods of representing the information:
- Literal representation (using each word, no edit)
- Simple keyword extraction (using word frequency after subtracting english stopwords)
- Bigram collocations
- Trigram collocations
- Keyword combo (word frequency + bigrams + trigrams)
Each of these ways of representing the source information can then be ‘boosted’ by using WordNet to generate synonyms, doubling the ways of representing the data (2×5=10!).
With these 10 different representations of each of the two datasources (2×10), I will use 3 ways to try to determine the subject:
- Literal label matching using two different sets of ontologies:
- Cyttron-set: Gene Ontology, Human Disease Ontology, Mouse Pathology & National Cancer Institute Thesaurus
- DBPedia ontology
- Matching the sources to descriptions of ontologyterms, using the same two sets of ontologies.
- If I manage: Matching the datasources to ‘context‘ of ontology terms.
I started working on a method to take a term in an ontology and explore its surrounding nodes. I will collect all ‘literals’ attached to a node, and throw them in a big pile of text. I will then use this pile of text as a bag of words, to match to the datasources.
This will bring the total amount of tests to be done to 120:
2 sources (wiki/cyttron) 10 representations of these sources 3 methods (literal/desc/context) 2 ontologies (cyttron-set/dbpedia) ------------------------------------- 2 * 10 * 3 * 2 = 120
12/12/12 update: since @sem_web moved to live in my Raspberry Pi, I’ve renamed him @grausPi
@sem_web is able to recognize 249 concepts, defined by the DBPedia ontology, and sends SPARQL queries to the DBPedia endpoint to retrieve more specific information about them. Currently, this means that @sem_web can check an incoming tweet (mention) for known concepts, and then return an instance (example) of the concept, along with a property of this instance, and the value for the property. An example of Sam’s output:
[findConcept] findConcept('video game') [findConcept] Looking for concept: video game [tweet] [u'http://dbpedia.org/class/yago/ComputerGame100458890', 'video game'] [findInst] Seed: [u'http://dbpedia.org/class/yago/ComputerGame100458890', 'video game'] [findInst] Has 367 instances. [findInst] Instance: Fight Night Round 3 [findProp] Has 11 properties. [findProp] [u'http://dbpedia.org/property/platforms', u'platforms'] [findVal] Property: platforms (has 1 values) [findVal] Value: Xbox 360, Xbox, PSP, PS2, PS3 [findVal] Domain: [u'Thing', u'work', u'software'] [findVal] We're talking about a thing... Fight Night Round 3 is a video game. Its platforms is Xbox 360, Xbox, PSP, PS2, PS3.
This is how it works:
- Look for words occurring in the tweet that match a given concept’s label.
- If found (concept): send a SPARQL query to retrieve an instance of the concept (an object with rdf:type concept).
- If not found: send a SPARQL query to retrieve a subClass of the concept. Go to step 1 with subClass as concept.
- If found (instance): send SPARQL queries to retrieve a property, value and domain of the instance. The domain is used to determine whether @sem_web is talking about a human or a thing.
- If no property with a value is found after several tries: Go to step 2 to retrieve a new instance.
- Compose a sentence (currently @sem_web has 4 different sentences) with the information (concept, instance, property, value).
Next to that, @sem_web posts random tweets once an hour, by picking a random concept from the DBPedia ontology. Working on @sem_web allows me to get to grips with both the SPARQL query language, and programming in Python (which, still, is something I haven’t done before in a larger-than-20-lines-of-code way).
What I’m working on next is a method to compare multiple concepts, when @sem_web detects more than one in a tweet. Currently, this works by taking each concept and querying for all the superClasses of the concept. I then store the path from the seed to the topClass (Entity) in a list, repeat the process for the next concept, and then compare both paths to the top, to identify a common parent-Class.
This is relevant for my graduation project as well, because a large task in determining the right subject for a text will be to determine the ‘proximity’ or similarity of different concepts in the text. Still, that specific task of determining ‘similarity’ or proximity of concepts is a much bigger thing, finding common superClasses is just a tiny step towards it. There are other interesting relationships to explore, for example partOf/sameAs relations. I’m curious to see what kind of information I will gather with this from larger texts.
An example of the concept comparison in action. From the following tweet:
>>> randomFriend() Picked mendicot: @offbeattravel .. FYI, my Twitter bot @vagabot found you by parsing (and attempting to answer) travel questions off the Twitter firehose ..
I received the following concepts:
5 concepts found. [u'http://dbpedia.org/class/yago/Bot102311879', u'http://dbpedia.org/class/yago/ChangeOfLocation107311115', u'http://dbpedia.org/class/yago/FYI(TVSeries)', u'http://dbpedia.org/class/yago/Locomotion100283127', u'http://dbpedia.org/class/yago/Travel100295701']
The findCommonParent function takes two URIs and processes them, appending a new list with the superClasses of the initial URI. This way I can track all the ‘hops’ made by counting the list number. As soon as the function processed both URIs, it starts comparing the pathLists to determine the first common parent.
>>> findCommonParents(found,found) [findParents] http://dbpedia.org/class/yago/ChangeOfLocation107311115 [findParents] Hop | Path: [findParents] 0 | [u'http://dbpedia.org/class/yago/ChangeOfLocation107311115'] [findParents] 1 | [u'http://dbpedia.org/class/yago/Movement107309781'] [findParents] 2 | [u'http://dbpedia.org/class/yago/Happening107283608'] [findParents] 3 | [u'http://dbpedia.org/class/yago/Event100029378'] [findParents] 4 | [u'http://dbpedia.org/class/yago/PsychologicalFeature100023100'] [findParents] 5 | [u'http://dbpedia.org/class/yago/Abstraction100002137'] [findParents] 6 | [u'http://dbpedia.org/class/yago/Entity100001740'] [findCommonP] 1st URI processed [findParents] http://dbpedia.org/class/yago/Locomotion100283127 [findParents] Hop | Path: [findParents] 0 | [u'http://dbpedia.org/class/yago/Locomotion100283127'] [findParents] 1 | [u'http://dbpedia.org/class/yago/Motion100279835'] [findParents] 2 | [u'http://dbpedia.org/class/yago/Change100191142'] [findParents] 3 | [u'http://dbpedia.org/class/yago/Action100037396'] [findParents] 4 | [u'http://dbpedia.org/class/yago/Act100030358'] [findParents] 5 | [u'http://dbpedia.org/class/yago/Event100029378'] [findParents] 6 | [u'http://dbpedia.org/class/yago/PsychologicalFeature100023100'] [findParents] 7 | [u'http://dbpedia.org/class/yago/Abstraction100002137'] [findParents] 8 | [u'http://dbpedia.org/class/yago/Entity100001740'] [findCommonP] 2nd URI processed [findCommonP] CommonParent found! [findCommonP] Result1 [findCommonP] matches with result2 [findCommonP] http://dbpedia.org/class/yago/Event100029378 [findCommonP] http://dbpedia.org/class/yago/Event100029378
Here you can see the first common parentClass is ‘Event’: 3 hops away from ‘ChangeOfLocation’, and 5 hops away from ‘Locomotion’. If it finds multiple superClasses, it will process multiple URIs at the same time (in one list). Anyway, this is just the basic stuff. There’s plenty more on my to-do list…
While the major part of the functionality I’m building for @sem_web will be directly usable for my thesis project, I haven’t been sitting still with more directly thesis-related things either. I’ve set up a local RDF store (Sesame store) on my laptop with all the needed bio-ontologies. RDFLib’s in-memory stores were clearly not up for the large ontologies I had to load each time. This also means I have to better structure my queries, as all information is not available at any given time. I also – unfortunately – learned that one of my initial plans: finding the shortest path between two nodes in an RDF store to determine ‘proximity’, is actually quite a complicated task. Next I will focus more on improving the concept comparison, taking more properties into account than only rdfs:subClass, and I’ll also work on extracting keywords (which I haven’t, but should have arranged testing data for)… Till next time!
But mostly, the last weeks I’ve been learning SPARQL, improving my Python skills, and getting a better and more concrete idea of the possible approaches for my thesis project by working on sem_web.
So, I am well underway finalizing the first part of my graduation project, the information extraction part. To re-iterate, I am currently working on matching textual content of a database to that of several ontology-files (big dictionaries containing loads of ‘things’ with relations defined). This is a flow-chart of the system I’m planning to build:
Currently I am working on my final project of the Media Technology MSc. Programme of Leiden University. With the goal of structuring my thoughts and process so far, and because I’ve promised on Twitter, I decided to write a small and simple summary of what my project is about, how I got here and what I’m expecting to do in the next 2-3months. If you want to jump ahead to what my project is about, jump to here.
A short history of my Media Technology graduation project
The idea of a graduation project for this particular master’s programme is to come up with your own inspiration to conduct a small autonomous research project. As Media Technology resides under the Leiden Institute of Advanced Computer Science faculty, using ‘computer science’ as a tool in your research is not very uncommon.
After finalizing the last couple of courses, I started out looking for inspiration for a research project. From a previous course I came into contact with (low-level) text analysis tasks, using the Python programming language and NLTK (Natural Language ToolKit, a very cool, free and open-source text analysis ‘swiss army knife’). I became interested in the possibilities of (statistical) text analysis. I liked the idea of using simple tools to perform research on the web, so I started looking at the features of NLTK and different Natural Language Processing techniques to include semantics in “web-research”. Having found these starting points, it was time to formulate research proposals.
My initial proposal was not very well fleshed out, more of a way to let the Media Technology board know what I was looking at, and basically to receive a go for the actual work (which to me still was to define my actual project). The proposal involved crawling lots of blogs to perform small scale analyses on, using low-level NLP techniques to go beyond simple statistics and wordfrequency-type research – to include meaning and semantics. The board decided my proposals were concrete enough to approve.
Another part of sending in my proposals and going ahead with the project was finding a supervisor. From a course on AI I took last year I remembered a PhD Student at Leiden University, who was involved/interested in semantics and the semantic web, so I figured he would be the right person to talk to. Soon after contacting him I understood he was only allowed to supervise me if my research contributed to what the Bio-Imaging Group was working on. This worried me at first, but after talking with Joris, I figured my project could actually be close enough to what I wanted to do, with the added advantages that:
- My research would actually contribute to something
- My domain would be comfortably restricted
The short explanation: Automatically analyzing and categorizing a large number of texts to be able to define their subjects. In my specific case the texts will be ‘free-form’, natural language descriptions of microscopic imagery, from the Cyttron database. This database contains a large number of images, accompanied by a small description (written by scientists) and a small list of tagwords. That is, if either of these fields are filled in at all. Because of the inconsistent style and method of writing these descriptions, an automated system to properly categorize the images would be very useful.
To perform this analysis, the idea is to use biological ontologies. Ontologies are basically large ‘dictionaries’ containing very specific (biological) terms with their definitions. The ontologies do not only contain their definitions, they also contain how these terms relate to each other. It basically provides me with a hierarchy of terms that says what is part of what, equal to what, etc.
Using these ontologies to analyze the texts allows not only to be able to define the subject of the text, but also to use the data in the ontology to be able to say more about the subject than what can be found in the text.
When I run into problems, I could at some point determine whether existing (biological) ontologies are either missing data, or whether there are more fundamental issues with the matching of the human-produced data with the ontologies.
How am I going to do this?
This part is very much subject to change, as I am taking my first steps in the entire semantic web/OWL/RDF-world, but also in the Python/NLTK-programming world. My current idea is:
- Python for text-processing
- RDFLib to read ontologies
- NLTK for the ‘language tasks’: stemming words, filtering for keywords, etc.
- Scanning the database for occurring ontology-terms (literal matches)
- Generating a list of keywords from both the free-form text and the ontology-term descriptions, to try to match those if no literal matches are found. I could try this using a bag-of-words-model, to remove all ‘common’ words, and keep the more specific/interesting ones. Another approach is to remove all stopwords from the texts and count the frequency of the remaining words.
- Possibly looking at keyphrase extraction instead of simple keywords [or maybe looking at word collocations/chunk extraction?].
- Apply fuzzy word matching to allow typo’s in the texts.
- Performing a statistical analysis on the likeliness of the subject. My thought is that ‘more specific’ (aka deeper nested) ontology terms should weigh heavier than more general terms. That I might potentially find clusters of terms (a number of terms that are more related to each other than other terms found) to further specify likeliness of subject matter. But I can imagine that when I actually get at this point, new ideas might emerge.
- The idea is to acquire some (humanly-checked) training data so I can optimize the system and see what approaches work best.
And what if it works?
Even though this sounds far away currently, I will have to take this scenario into account :p. My idea is to use the software I have written in other domains. Maybe even the domain I was thinking about earlier (using the web as a source for research, blogs, social media, news sites, wiki/dbpedia, etc.). I already came across the OpenCYC Ontology – “hundreds of thousands of terms, along with millions of assertions relating the terms to each other, forming an ontology whose domain is all of human consensus reality”. Which sounds pretty damn awesome.
Some quick ideas I had were using this ontology to create some sort of ‘semantic recommender system’ (on what domain? News sites? Blogs?), or find some other way to extract meaning from large corpora of texts. Anyway, those are ideas for the future, but I hope that I’ll be able to play around a bit with different applications by the time I’ve finished what I’m supposed to do :).