Just a quick update to let myself know what’s going to happen next: It’s time to produce some results! While I was getting quite stuck in figuring out the best – or rather, most practical – way to extract keywords from a text (and not just any text, mind you, but notes of biologists), my supervisor told me it’s time to get some results. Hard figures. I decided to scrap POS-tagging the notes to extract valuable phrases, after I noticed the accuracy of the default NLTK POS-tagger was way below practical usage. Not too surprising, considering the default NLTK tagger is probably not trained on biologists’ notes.
Anyway, we came up with the following tests:
I use two sources:
- The first being the biologist’s notes (the Cyttron DB).
- The second being specific Wikipedia pages on recurring topics of the Cyttron DB:
Alzheimer, Apoptis, Tau protein & Zebrafish.
From these two sources, I will use five different methods of representing the information:
- Literal representation (using each word, no edit)
- Simple keyword extraction (using word frequency after subtracting english stopwords)
- Bigram collocations
- Trigram collocations
- Keyword combo (word frequency + bigrams + trigrams)
Each of these ways of representing the source information can then be ‘boosted’ by using WordNet to generate synonyms, doubling the ways of representing the data (2×5=10!).
With these 10 different representations of each of the two datasources (2×10), I will use 3 ways to try to determine the subject:
- Literal label matching using two different sets of ontologies:
- Cyttron-set: Gene Ontology, Human Disease Ontology, Mouse Pathology & National Cancer Institute Thesaurus
- DBPedia ontology
- Matching the sources to descriptions of ontologyterms, using the same two sets of ontologies.
- If I manage: Matching the datasources to ‘context‘ of ontology terms.
I started working on a method to take a term in an ontology and explore its surrounding nodes. I will collect all ‘literals’ attached to a node, and throw them in a big pile of text. I will then use this pile of text as a bag of words, to match to the datasources.
This will bring the total amount of tests to be done to 120:
2 sources (wiki/cyttron) 10 representations of these sources 3 methods (literal/desc/context) 2 ontologies (cyttron-set/dbpedia) ------------------------------------- 2 * 10 * 3 * 2 = 120