📅 July 29, 2011 • 🕐 23:58 • 🏷 Thesis (MSc) • 👁 96

So, I am well underway finalizing the first part of my graduation project, the information extraction part. To re-iterate, I am currently working on matching textual content of a database to that of several ontology-files (big dictionaries containing loads of ‘things’ with relations defined). This is a flow-chart of the system I’m planning to build:

In the area of ‘preparing’ the data of both the database and the ontologies, I am currently here:

Prepping ontology data

  • Extract all labels from ontology files [names of ‘things’]
  • Extract all descriptions from an ontology file [descriptions of ‘things’]
  • POS-tag each word in each ontology description [identify words: which are nouns, wich are adjectives etc.]
    • Extract specific word types [‘generate keywords’: for example extract all nouns from description].
  • Extract word collocations from ontology descriptions [pairs of words which occur frequently together]

Prepping database data

  • Store all entries of the database
  • Remove all duplicate entries in the database
  • Extract word collocations from database entries

Matching content

  • Ontology labels occurring in a database entry.
    • Using fuzzy word matching! (allows for minor typo’s)

Left TODO [for the information extraction bit]:

  • Figure out how to handle matching ontology-descriptions to db-entries. The idea is to perform such a comparison if I find no literal (ontology-names) matches. The ontology descriptions are obviously ‘broader’ than a single term, so if I cannot find a literal match, I can try to find several words which occur both in the db-entry and in a description of an ontology-‘thing’. I might be able to conclude the ‘thing’ described in the ontology is likely to be the subject of the db-entry.
  • Finding a solution to the problem that I cannot parse large OWL-files with RDFlib. Either I try rdfextras’ local Store solutions, or I give Sesame a shot to store graphs locally, and use rdflib to query the local graphs.
  • Figure out the most optimal method of keyword extraction. At the moment I have two approaches: either extracting word collocations from a text or extracting all nouns from a text. I could also extract specific phrases instead of simply getting all nouns. To see what works best I will have to get some training data, by asking biologists to manually select the most important keywords from any entry of the database. Using this data as a reference (text and important keywords) I will measure the effectiveness of different approaches, and ultimately choose the most successful approach.

And next!

  • Figure out how to weigh and relate multiple literal matches within one ontology.
  • Figure out how to handle having literal matches spanning over multiple ontologies (how do I cross-relate the terms to one another).

Till next time!