Graduation pt. 4: What’s next

Just a quick update to let myself know what’s going to happen next: It’s time to produce some results! While I was getting quite stuck in figuring out the best – or rather, most practical – way to extract keywords from a text (and not just any text, mind you, but notes of biologists), my supervisor told me it’s time to get some results. Hard figures. I decided to scrap POS-tagging the notes to extract valuable phrases, after I noticed the accuracy of the default NLTK POS-tagger was way below practical usage. Not too surprising, considering the default NLTK tagger is probably not trained on biologists’ notes.

Anyway, we came up with the following tests:

I use two sources:

  1. The first being the biologist’s notes (the Cyttron DB).
  2. The second being specific Wikipedia pages on recurring topics of the Cyttron DB:
    Alzheimer, Apoptis, Tau protein & Zebrafish.

From these two sources, I will use five different methods of representing the information:

  1. Literal representation (using each word, no edit)
  2. Simple keyword extraction (using word frequency after subtracting english stopwords)
  3. Bigram collocations
  4. Trigram collocations
  5. Keyword combo (word frequency + bigrams + trigrams)

Each of these ways of representing the source information can then be ‘boosted’  by using WordNet to generate synonyms, doubling the ways of representing the data (2×5=10!).

With these 10 different representations of each of the two datasources (2×10), I will use 3 ways to try to determine the subject:

  1. Literal label matching using two different sets of ontologies:
    1. Cyttron-set: Gene Ontology, Human Disease Ontology, Mouse Pathology & National Cancer Institute Thesaurus
    2. DBPedia ontology
  2. Matching the sources to descriptions of ontologyterms, using the same two sets of ontologies.
  3. If I manage: Matching the datasources to ‘context‘ of ontology terms.
    I started working on a method to take a term in an ontology and explore its surrounding nodes. I will collect all ‘literals’ attached to a node, and throw them in a big pile of text. I will then use this pile of text as a bag of words, to match to the datasources.

This will bring the total amount of tests to be done to 120:

  2 sources (wiki/cyttron)
 10 representations of these sources
  3 methods (literal/desc/context)
  2 ontologies (cyttron-set/dbpedia)
 -------------------------------------
  2 * 10 * 3 * 2 = 120

And in-between I also have Lowlands 2011 and Vollt to attend. Oh gosh…

[read all my thesis-related posts]

Leave a Reply