Graduation Project pt. 2

So, I am well underway finalizing the first part of my graduation project, the information extraction part. To re-iterate, I am currently working on matching textual content of a database to that of several ontology-files (big dictionaries containing loads of ‘things’ with relations defined). This is a flow-chart of the system I’m planning to build:

In the area of ‘preparing’ the data of both the database and the ontologies, I am currently here:

Prepping ontology data

Extract all labels from ontology files [names of ‘things’]
Extract all descriptions from an ontology file [descriptions of ‘things’]
POS-tag each word in each ontology description [identify words: which are nouns, wich are adjectives etc.]

Extract specific word types [‘generate keywords’: for example extract all nouns from description].

Extract word collocations from ontology descriptions [pairs of words which occur frequently together]

Prepping database data

Store all entries of the database
Remove all duplicate entries in the database
Extract word collocations from database entries

Matching content

Ontology labels occurring in a database entry.

Using fuzzy word matching! (allows for minor typo’s)

Left TODO [for the information extraction bit]:

Figure out how to handle matching ontology-descriptions to db-entries. The idea is to perform such a comparison if I find no literal (ontology-names) matches. The ontology descriptions are obviously ‘broader’ than a single term, so if I cannot find a literal match, I can try to find several words which occur both in the db-entry and in a description of an ontology-‘thing’. I might be able to conclude the ‘thing’ described in the ontology is likely to be the subject of the db-entry.
Finding a solution to the problem that I cannot parse large OWL-files with RDFlib. Either I try rdfextras’ local Store solutions, or I give Sesame a shot to store graphs locally, and use rdflib to query the local graphs.
Figure out the most optimal method of keyword extraction. At the moment I have two approaches: either extracting word collocations from a text or extracting all nouns from a text. I could also extract specific phrases instead of simply getting all nouns. To see what works best I will have to get some training data, by asking biologists to manually select the most important keywords from any entry of the database. Using this data as a reference (text and important keywords) I will measure the effectiveness of different approaches, and ultimately choose the most successful approach.

And next!

Figure out how to weigh and relate multiple literal matches within one ontology.
Figure out how to handle having literal matches spanning over multiple ontologies (how do I cross-relate the terms to one another).

Till next time!

Related

Leave a ReplyCancel reply