So, I am well underway finalizing the first part of my graduation project, the information extraction part. To re-iterate, I am currently working on matching textual content of a database to that of several ontology-files (big dictionaries containing loads of ‘things’ with relations defined). This is a flow-chart of the system I’m planning to build:
In the area of ‘preparing’ the data of both the database and the ontologies, I am currently here:
Prepping ontology data
- Extract all labels from ontology files [names of ‘things’]
- Extract all descriptions from an ontology file [descriptions of ‘things’]
- POS-tag each word in each ontology description [identify words: which are nouns, wich are adjectives etc.]
- Extract specific word types [‘generate keywords’: for example extract all nouns from description].
- Extract word collocations from ontology descriptions [pairs of words which occur frequently together]
Prepping database data
- Store all entries of the database
- Remove all duplicate entries in the database
- Extract word collocations from database entries
- Ontology labels occurring in a database entry.
- Using fuzzy word matching! (allows for minor typo’s)
Left TODO [for the information extraction bit]:
- Figure out how to handle matching ontology-descriptions to db-entries. The idea is to perform such a comparison if I find no literal (ontology-names) matches. The ontology descriptions are obviously ‘broader’ than a single term, so if I cannot find a literal match, I can try to find several words which occur both in the db-entry and in a description of an ontology-‘thing’. I might be able to conclude the ‘thing’ described in the ontology is likely to be the subject of the db-entry.
- Finding a solution to the problem that I cannot parse large OWL-files with RDFlib. Either I try rdfextras’ local Store solutions, or I give Sesame a shot to store graphs locally, and use rdflib to query the local graphs.
- Figure out the most optimal method of keyword extraction. At the moment I have two approaches: either extracting word collocations from a text or extracting all nouns from a text. I could also extract specific phrases instead of simply getting all nouns. To see what works best I will have to get some training data, by asking biologists to manually select the most important keywords from any entry of the database. Using this data as a reference (text and important keywords) I will measure the effectiveness of different approaches, and ultimately choose the most successful approach.
- Figure out how to weigh and relate multiple literal matches within one ontology.
- Figure out how to handle having literal matches spanning over multiple ontologies (how do I cross-relate the terms to one another).
Till next time!