David Graus

PhD Candidate Semantic Search in eDiscovery

Graduation project

Friday, July 8, 2011
279 views
0 comments

Currently I am working on my final project of the Media Technology MSc. Programme of Leiden University. With the goal of structuring my thoughts and process so far, and because I’ve promised on Twitter, I decided to write a small and simple summary of what my project is about, how I got here and what I’m expecting to do in the next 2-3months. If you want to jump ahead to what my project is about, jump to here.

A short history of my Media Technology graduation project

The idea of a graduation project for this particular master’s programme is to come up with your own inspiration to conduct a small autonomous research project. As Media Technology resides under the Leiden Institute of Advanced Computer Science faculty, using ‘computer science’ as a tool in your research is not very uncommon.

After finalizing the last couple of courses, I started out looking for inspiration for a research project. From a previous course I came into contact with (low-level) text analysis tasks, using the Python programming language and NLTK (Natural Language ToolKit, a very cool, free and open-source text analysis ‘swiss army knife’). I became interested in the possibilities of (statistical) text analysis. I liked the idea of using simple tools to perform research on the web, so I started looking at the features of NLTK and different Natural Language Processing techniques to include semantics in “web-research”. Having found these starting points, it was time to formulate research proposals.

My initial proposal was not very well fleshed out, more of a way to let the Media Technology board know what I was looking at, and basically to receive a go for the actual work (which to me still was to define my actual project). The proposal involved crawling lots of blogs to perform small scale analyses on, using low-level NLP techniques to go beyond simple statistics and wordfrequency-type research – to include meaning and semantics. The board decided my proposals were concrete enough to approve.

Another part of sending in my proposals and going ahead with the project was finding a supervisor. From a course on AI I took last year I remembered a PhD Student at Leiden University, who was involved/interested in semantics and the semantic web, so I figured he would be the right person to talk to. Soon after contacting him I understood he was only allowed to supervise me if my research contributed to what the Bio-Imaging Group was working on. This worried me at first, but after talking with Joris, I figured my project could actually be close enough to what I wanted to do, with the added advantages that:

  • My research would actually contribute to something
  • My domain would be comfortably restricted

So, what am I actually going to do?

The short explanation: Automatically analyzing and categorizing a large number of texts to be able to define their subjects. In my specific case the texts will be ‘free-form’, natural language descriptions of microscopic imagery, from the Cyttron database. This database contains a large number of images, accompanied by a small description (written by scientists) and a small list of tagwords. That is, if either of these fields are filled in at all. Because of the inconsistent style and method of writing these descriptions, an automated system to properly categorize the images would be very useful.

To perform this analysis, the idea is to use biological ontologies. Ontologies are basically large ‘dictionaries’ containing very specific (biological) terms with their definitions. The ontologies do not only contain their definitions, they also contain how these terms relate to each other. It basically provides me with a hierarchy of terms that says what is part of what, equal to what, etc.

Using these ontologies to analyze the texts allows not only to be able to define the subject of the text, but also to use the data in the ontology to be able to say more about the subject than what can be found in the text.

When I run into problems, I could at some point determine whether existing (biological) ontologies are either missing data, or whether there are more fundamental issues with the matching of the human-produced data with the ontologies.

How am I going to do this?

This part is very much subject to change, as I am taking my first steps in the entire semantic web/OWL/RDF-world, but also in the Python/NLTK-programming world. My current idea is:

Tools

  • Python for text-processing
  • RDFLib to read ontologies
  • NLTK for the ‘language tasks’: stemming words, filtering for keywords, etc.

Approach

  1. Scanning the database for occurring ontology-terms (literal matches)
  2. Generating a list of keywords from both the free-form text and the ontology-term descriptions, to try to match those if no literal matches are found. I could try this using a bag-of-words-model, to remove all ‘common’ words, and keep the more specific/interesting ones. Another approach is to remove all stopwords from the texts and count the frequency of the remaining words.
  3. Possibly looking at keyphrase extraction instead of simple keywords [or maybe looking at word collocations/chunk extraction?]. 
  4. Apply fuzzy word matching to allow typo’s in the texts. 
  5. Performing a statistical analysis on the likeliness of the subject. My thought is that ‘more specific’ (aka deeper nested) ontology terms should weigh heavier than more general terms. That I might potentially find clusters of terms (a number of terms that are more related to each other than other terms found) to further specify likeliness of subject matter. But I can imagine that when I actually get at this point, new ideas might emerge.
  6. The idea is to acquire some (humanly-checked) training data so I can optimize the system and see what approaches work best.
And that’s about as far as I am right now. The real work: new problems and approaches, will probably surface as soon as I get more into the material.

And what if it works?

Even though this sounds far away currently, I will have to take this scenario into account :p. My idea is to use the software I have written in other domains. Maybe even the domain I was thinking about earlier (using the web as a source for research, blogs, social media, news sites, wiki/dbpedia, etc.). I already came across the OpenCYC Ontology – “hundreds of thousands of terms, along with millions of assertions relating the terms to each other, forming an ontology whose domain is all of human consensus reality”. Which sounds pretty damn awesome.

Some quick ideas I had were using this ontology to create some sort of ‘semantic recommender system’ (on what domain? News sites? Blogs?), or find some other way to extract meaning from large corpora of texts. Anyway, those are ideas for the future, but I hope that I’ll be able to play around a bit with different applications by the time I’ve finished what I’m supposed to do :).