Results? Thesis #5

📅 September 5, 2011 🕐 11:45 🏷 Thesis (MSc)

As promised, I have spent the last two weeks generating a lot (but not quite 120) results. So let’s take a quick look at what I’ve done and found.

First of all, the Cyttron DB. Here I show 4 different methods of representing the Cyttron database, the 1st is as-is (literal), the 2nd by keyword extraction (10 most frequently occurring words, after filtering for stopwords), the 3rd is by generating synonyms with WordNet for each word in the database, the 4th is by generating synonyms with WordNet for each word of the keyword representation.

Cy-literal Cy-keywords Cy-WN Cy-key-WN
Unique 19,80 3,23 97,58 17,59
Total 30,58 3,18 248,19 25,53

Next up, the Wikipedia-page for Alzheimer’s disease. Here I have used the literal text, the 10 most frequently occurring bigrams (2-word words), the 10 most frequently occurring trigrams (3-word words), the 10 most frequently occurring keywords (after stopwords filtering) and the WordNet-boosted text (generating synonyms with WordNet for each word).

Alz-Literal Alz-bigrams Alz-trigrams Alz-keywords Alz-WN
Unique 803 8 1 5 1385
Total 3292 8 1 6 22.195

The other approach, using the ontologies’ term’s descriptions didn’t quite fare as well as I’d hoped. I used Python’s built-in difflib module, which at the time seemed like the right module to use, but after closer inspection did not quite get the results I was looking for. The next plan is to take a more simple approach, by extracting keywords from the description texts to use as a counting measure in much the same way I do the literal matching.

All the results I generated are hard to evaluate, as long as I do not have a method to measure the relations between the found labels. More labels is not necesarily better, more relevant labels is the goal. When I ‘WordNet’-boost a text (aka generate a bunch of synonyms for each word I find), I do get a lot more literal matches: but I will only know if this makes determining the subject easier or harder once I have a method to relate all found labels to each other and maybe find a cluster of terms which occur frequently.

What’s next?

I am now working on a simple breadth-first search algorithm, which takes a ‘start’-node and a ‘goal’-node, queries for direct neighbours of the start-node one ‘hop’ at a time, until it reaches the goal-node. It will then be possible to determine the relation between two nodes. Note that this will only work within one ontology, if the most frequent terms come from different ontologies, I am forced to use simple linguistic matching (as I am doing now), to determine ‘relatedness’. But as the ontologies all have a distinct field, I imagine the most frequent terms will most likely come from one ontology.

So, after I’ve finished the BFS algorithm, I will have to determine the final keyword-extraction methods, and ways of representing the source data. My current keyword-extraction methods (word frequency and bi/trigrams) rely on a large body of reference material, the longer the DB entry the more effective these methods are (or at least, the more ‘right’ the extracted keywords are, most frequent trigrams from a 10-word entry makes no sense).

Matching terms from ontologies is much more suited for smaller texts. And because of the specificity of the ontologies’ domain, there is an automatic filter of non-relevant words. Bio-ontologies contain biological terms: matching a text to those automatically keeps only the words I’m looking for. The only problem is that you potentially miss out on words which are missing from the ontology, which is an important part of my thesis.

Ideally, the final implementation will use both approaches; ontology matching to quickly find the relevant words, calculate relations, and then keyword extraction to double-check if no important or relevant words have been skipped.

To generate the next bunch of results, I am going to limit the size of both the reference-ontologies as the source data. As WordNet-boosted literal term-matching took well over 20 hours on my laptop, I will limit the ontologies to 1 or 2, and will select around 10 represenatitive Cyttron DB-entries.

I am now running my Sesame RDF Store on my eee-pc (which also hosts @sem_web), which is running 24/7 and accessible from both my computers (desktop and laptop)! Also, I am now on GitHub. There’s more results there, check me out » http://github.com/dvdgrs/thesis.

[Read all thesis-related posts here]

Graduation pt. 4: What’s next

📅 August 19, 2011 🕐 10:41 🏷 Thesis (MSc)

Just a quick update to let myself know what’s going to happen next: It’s time to produce some results! While I was getting quite stuck in figuring out the best – or rather, most practical – way to extract keywords from a text (and not just any text, mind you, but notes of biologists), my supervisor told me it’s time to get some results. Hard figures. I decided to scrap POS-tagging the notes to extract valuable phrases, after I noticed the accuracy of the default NLTK POS-tagger was way below practical usage. Not too surprising, considering the default NLTK tagger is probably not trained on biologists’ notes.

Anyway, we came up with the following tests:

I use two sources:

  1. The first being the biologist’s notes (the Cyttron DB).
  2. The second being specific Wikipedia pages on recurring topics of the Cyttron DB:
    Alzheimer, Apoptis, Tau protein & Zebrafish.

From these two sources, I will use five different methods of representing the information:

  1. Literal representation (using each word, no edit)
  2. Simple keyword extraction (using word frequency after subtracting english stopwords)
  3. Bigram collocations
  4. Trigram collocations
  5. Keyword combo (word frequency + bigrams + trigrams)

Each of these ways of representing the source information can then be ‘boosted’  by using WordNet to generate synonyms, doubling the ways of representing the data (2×5=10!).

With these 10 different representations of each of the two datasources (2×10), I will use 3 ways to try to determine the subject:

  1. Literal label matching using two different sets of ontologies:
    1. Cyttron-set: Gene Ontology, Human Disease Ontology, Mouse Pathology & National Cancer Institute Thesaurus
    2. DBPedia ontology
  2. Matching the sources to descriptions of ontologyterms, using the same two sets of ontologies.
  3. If I manage: Matching the datasources to ‘context‘ of ontology terms.
    I started working on a method to take a term in an ontology and explore its surrounding nodes. I will collect all ‘literals’ attached to a node, and throw them in a big pile of text. I will then use this pile of text as a bag of words, to match to the datasources.

This will bring the total amount of tests to be done to 120:

  2 sources (wiki/cyttron)
 10 representations of these sources
  3 methods (literal/desc/context)
  2 ontologies (cyttron-set/dbpedia)
 -------------------------------------
  2 * 10 * 3 * 2 = 120

And in-between I also have Lowlands 2011 and Vollt to attend. Oh gosh…

[read all my thesis-related posts]

Graduation Project pt. 2

📅 July 29, 2011 🕐 23:58 🏷 Thesis (MSc)

So, I am well underway finalizing the first part of my graduation project, the information extraction part. To re-iterate, I am currently working on matching textual content of a database to that of several ontology-files (big dictionaries containing loads of ‘things’ with relations defined). This is a flow-chart of the system I’m planning to build:

(more…)

Graduation project

📅 July 8, 2011 🕐 20:22 🏷 Thesis (MSc)

Currently I am working on my final project of the Media Technology MSc. Programme of Leiden University. With the goal of structuring my thoughts and process so far, and because I’ve promised on Twitter, I decided to write a small and simple summary of what my project is about, how I got here and what I’m expecting to do in the next 2-3months. If you want to jump ahead to what my project is about, jump to here.

A short history of my Media Technology graduation project

The idea of a graduation project for this particular master’s programme is to come up with your own inspiration to conduct a small autonomous research project. As Media Technology resides under the Leiden Institute of Advanced Computer Science faculty, using ‘computer science’ as a tool in your research is not very uncommon.

After finalizing the last couple of courses, I started out looking for inspiration for a research project. From a previous course I came into contact with (low-level) text analysis tasks, using the Python programming language and NLTK (Natural Language ToolKit, a very cool, free and open-source text analysis ‘swiss army knife’). I became interested in the possibilities of (statistical) text analysis. I liked the idea of using simple tools to perform research on the web, so I started looking at the features of NLTK and different Natural Language Processing techniques to include semantics in “web-research”. Having found these starting points, it was time to formulate research proposals.

My initial proposal was not very well fleshed out, more of a way to let the Media Technology board know what I was looking at, and basically to receive a go for the actual work (which to me still was to define my actual project). The proposal involved crawling lots of blogs to perform small scale analyses on, using low-level NLP techniques to go beyond simple statistics and wordfrequency-type research – to include meaning and semantics. The board decided my proposals were concrete enough to approve.

Another part of sending in my proposals and going ahead with the project was finding a supervisor. From a course on AI I took last year I remembered a PhD Student at Leiden University, who was involved/interested in semantics and the semantic web, so I figured he would be the right person to talk to. Soon after contacting him I understood he was only allowed to supervise me if my research contributed to what the Bio-Imaging Group was working on. This worried me at first, but after talking with Joris, I figured my project could actually be close enough to what I wanted to do, with the added advantages that:

  • My research would actually contribute to something
  • My domain would be comfortably restricted

So, what am I actually going to do?

The short explanation: Automatically analyzing and categorizing a large number of texts to be able to define their subjects. In my specific case the texts will be ‘free-form’, natural language descriptions of microscopic imagery, from the Cyttron database. This database contains a large number of images, accompanied by a small description (written by scientists) and a small list of tagwords. That is, if either of these fields are filled in at all. Because of the inconsistent style and method of writing these descriptions, an automated system to properly categorize the images would be very useful.

To perform this analysis, the idea is to use biological ontologies. Ontologies are basically large ‘dictionaries’ containing very specific (biological) terms with their definitions. The ontologies do not only contain their definitions, they also contain how these terms relate to each other. It basically provides me with a hierarchy of terms that says what is part of what, equal to what, etc.

Using these ontologies to analyze the texts allows not only to be able to define the subject of the text, but also to use the data in the ontology to be able to say more about the subject than what can be found in the text.

When I run into problems, I could at some point determine whether existing (biological) ontologies are either missing data, or whether there are more fundamental issues with the matching of the human-produced data with the ontologies.

How am I going to do this?

This part is very much subject to change, as I am taking my first steps in the entire semantic web/OWL/RDF-world, but also in the Python/NLTK-programming world. My current idea is:

Tools

  • Python for text-processing
  • RDFLib to read ontologies
  • NLTK for the ‘language tasks’: stemming words, filtering for keywords, etc.

Approach

  1. Scanning the database for occurring ontology-terms (literal matches)
  2. Generating a list of keywords from both the free-form text and the ontology-term descriptions, to try to match those if no literal matches are found. I could try this using a bag-of-words-model, to remove all ‘common’ words, and keep the more specific/interesting ones. Another approach is to remove all stopwords from the texts and count the frequency of the remaining words.
  3. Possibly looking at keyphrase extraction instead of simple keywords [or maybe looking at word collocations/chunk extraction?]. 
  4. Apply fuzzy word matching to allow typo’s in the texts. 
  5. Performing a statistical analysis on the likeliness of the subject. My thought is that ‘more specific’ (aka deeper nested) ontology terms should weigh heavier than more general terms. That I might potentially find clusters of terms (a number of terms that are more related to each other than other terms found) to further specify likeliness of subject matter. But I can imagine that when I actually get at this point, new ideas might emerge.
  6. The idea is to acquire some (humanly-checked) training data so I can optimize the system and see what approaches work best.
And that’s about as far as I am right now. The real work: new problems and approaches, will probably surface as soon as I get more into the material.

And what if it works?

Even though this sounds far away currently, I will have to take this scenario into account :p. My idea is to use the software I have written in other domains. Maybe even the domain I was thinking about earlier (using the web as a source for research, blogs, social media, news sites, wiki/dbpedia, etc.). I already came across the OpenCYC Ontology – “hundreds of thousands of terms, along with millions of assertions relating the terms to each other, forming an ontology whose domain is all of human consensus reality”. Which sounds pretty damn awesome.

Some quick ideas I had were using this ontology to create some sort of ‘semantic recommender system’ (on what domain? News sites? Blogs?), or find some other way to extract meaning from large corpora of texts. Anyway, those are ideas for the future, but I hope that I’ll be able to play around a bit with different applications by the time I’ve finished what I’m supposed to do :).