The last couple of weeks I’ve been diving into the task of entity linking, in the context of a submission to the Text Analysis Conference Knowledge Base Population track (that’s quite a mouthful – TAC KBP from now on). A ‘contest’ in Knowledge Base Population with a standardized task, dataset and evaluation. Before I’m devoting my next post to our submission, let me first explain the task of entity linking in this post :-).
Knowledge Base Population
Knowledge Base Population is an information extraction task of generating knowledge bases from raw, unstructured text. A Knowledge Base is essentially a database which describes unique concepts (things) and contains information about these concepts. Wikipedia is a Knowledge Base: each article represents a unique concept, the article’s body contains information about the concept, the infobox provides ‘structured’ information, and links to other pages provide ‘semantic’ (or rather: relational) information.
Filling a Knowledge Base from raw textual data can be broadly split up into two subtasks: entity linking (identifying unique entities in a corpus of documents to add to the Knowledge Base) and slot filling (finding relations of and between these unique entities, and add these to the concepts in the KB).
Entity linking is the task of linking a concept (thing) to a mention (word/words) in a document. Not unlike semantic annotation, this task is essentially about defining the meaning of a word, by assigning the correct concept to it. Consider these examples:
Bush went to war in Iraq
Bush won over Dukakis
A bush under a tree in the forest
A tree in the forest
A hierarchical tree
It is clear that the bushes and trees in these sentences refer to different concepts. The idea is to link the bold words to the people or things they refer to. To complete this task, we are given:
- KB: a Wikipedia-derived knowledge base containing concepts, people, places, etc. Each concept has a title, text (the Wikipedia page’s content), some metadata (infobox properties), etc.
- DOC: the (context) document in which the entity we are trying to link occurs (in the case of the examples: the sentence)
- Q: The entity-mention query in the document (the word)
The goal is to identify whether the entity referred to by Q is in the KB: this means that if it isn’t, it should be considered a ‘new’ entity. All the new entities should be clustered; that means when two documents refer to the same ‘new’ entity, this must be represented in assigning the same new ID to both words.
A common approach
1. Query Expansion: This means finding more surface forms that refer to entity Q. Two approaches:
I: This can be derived from the document itself, using for example ‘coreference resolution’. In this case you try to identify all strings referring to the same entity. In the Bush example, you might find “President Bush” or “George W. Bush” somewhere in the document.
II: This can also be done by using external information sources, such as looking up the word in a database of Wikipedia anchor texts, page titles, redirect strings or disambiguation pages. Using Wikipedia to expand ‘Q=cheese’ could lead to:
Title: Cheese. Redirects: Home cheesemaking, Cheeses, CHEESE, Cheeze, Chees, Chese, Coagulated milk curd. Anchors: cheese, Cheese, cheeses, Cheeses, maturation, CHEESE, 450 varieties of cheese, Queso, Soft cheese, aging process of cheese, chees, cheese factor, cheese wheel, cheis, double and triple cream cheese, formaggi, fromage, hard cheeses, kebbuck, masification, semi-hard to hard, soft cheese, soft-ripened, wheel of cheese, Fromage, cheese making, cheese-making, cheesy, coagulated, curds, grated cheese, hard, lyres, washed rind, wheels.
2. Candidate Generation: For each surface form of Q, try to find KB entries that could be referred to. Simple approaches are searching for Wikipedia titles that contain the surface form, looking through anchor link texts (titles used to refer to a specific Wikipedia page in another Wikipedia page), expanding acronyms (if you find a string containing only uppercase-letters, try to find a matching word sequence).
3. Candidate Ranking: The final step would be selecting the most probable candidate from the previous step. Simple approaches can be comparing the similarity of the context document to each candidate document (Wikipedia page), more advanced approaches involve measuring semantic similarity on higher levels: e.g. by finding ‘related’ entities in the context document.
4. NIL Clustering: Whenever no candidate can be found (or only candidates with a low probability of being the right one – measured in any way), it could be decided that the entity referred to is not in the KB. In this case, the job is to assign a new ID to the entity, and if it ever is referred in a later document, attach this same ID. This is a matter of (unsupervised) clustering. Successfull approaches include simple string similarity (same ‘new’ entities being referred to by the same word), document similarity (using simple comparisons) or more advanced clustering approaches such as LDA/HDP.
Now this is just a general introduction. If you are interested in the technical side of some common approaches, take a look at the TAC Proceedings. Particularly the Overview of the TAC2011 Knowledge Base Population Trac [PDF], and the Proceedings Papers. If you are unsure why Wikipedia is an awesome resource for entity linking (or any form of extracting structured information from unstructured text), I’d recommend you read ‘Mining Meaning from Wikipedia‘ (Medelyan et al., 2009)
Next post will hopefully be about ILPS’ award-winning entity linking system, so stay tuned ;-).