Graduation pt. 4: What's next • David Graus

Just a quick update to let myself know what’s going to happen next: It’s time to produce some results! While I was getting quite stuck in figuring out the best – or rather, most practical – way to extract keywords from a text (and not just any text, mind you, but notes of biologists), my supervisor told me it’s time to get some results. Hard figures. I decided to scrap POS-tagging the notes to extract valuable phrases, after I noticed the accuracy of the default NLTK POS-tagger was way below practical usage. Not too surprising, considering the default NLTK tagger is probably not trained on biologists’ notes.

Anyway, we came up with the following tests:

I use two sources:

The first being the biologist’s notes (the Cyttron DB).
The second being specific Wikipedia pages on recurring topics of the Cyttron DB:
Alzheimer, Apoptis, Tau protein & Zebrafish.

From these two sources, I will use five different methods of representing the information:

Literal representation (using each word, no edit)
Simple keyword extraction (using word frequency after subtracting english stopwords)
Bigram collocations
Trigram collocations
Keyword combo (word frequency + bigrams + trigrams)

Each of these ways of representing the source information can then be ‘boosted’ by using WordNet to generate synonyms, doubling the ways of representing the data (2×5=10!).

With these 10 different representations of each of the two datasources (2×10), I will use 3 ways to try to determine the subject:

Literal label matching using two different sets of ontologies:

Cyttron-set: Gene Ontology, Human Disease Ontology, Mouse Pathology & National Cancer Institute Thesaurus
DBPedia ontology

Matching the sources to descriptions of ontologyterms, using the same two sets of ontologies.
If I manage: Matching the datasources to ‘context‘ of ontology terms.
I started working on a method to take a term in an ontology and explore its surrounding nodes. I will collect all ‘literals’ attached to a node, and throw them in a big pile of text. I will then use this pile of text as a bag of words, to match to the datasources.

This will bring the total amount of tests to be done to 120:

  2 sources (wiki/cyttron)
 10 representations of these sources
  3 methods (literal/desc/context)
  2 ontologies (cyttron-set/dbpedia)
 -------------------------------------
  2 * 10 * 3 * 2 = 120

And in-between I also have Lowlands 2011 and Vollt to attend. Oh gosh…

[read all my thesis-related posts]

Graduation pt. 4: What’s next

Related

Leave a ReplyCancel reply