The goal of this post is to make the research I’m doing understandable to the general public. You know, to explain what I’m doing in a way not my peers, but my parents would understand. In part because the majority of the returning visitors of my blog are composed of my parents, in part because lots of people think it’s a good idea for scientists to blog about their work, and in part because I like blogging. And finally, I suppose, because this research is made possible by people who pay their taxes ;-).
In this post I’ll try to explain the paper ‘Context-Based Entity Linking – University Of Amsterdam at TAC 2012′ I wrote with Edgar Meij, Tom Kenter, Marc Bron and Maarten de Rijke. It will also hopefully provide some basic understanding of machine learning.
Paper: ‘Context-Based Entity Linking – University Of Amsterdam at TAC 2012′ (131.24 KB)
Entity linking is the task of linking a word in a piece of text, to an ‘entity’ or ‘concept’ from a knowledge base (think: Wikipedia). Why would we want to? Because it allows us to automatically detect what is being talked about in a document, as opposed to seeing what words it is composed of. It allows us to generate extra context, it allows us to generate metadata which can improve searching and archiving. It moves one step beyond simple word-based analysis. We want that.
The Text Analysis Conference is a yearly ‘benchmark event’, where a dataset is provided (lots of documents, a knowledge base, and a list of queries, words or ‘entity mentions’ that occur in the documents). I describe the task in more detail here. We participated in this track, by building and modifying a system that was created for entity-linking tweets.
We start by taking our query, and search the knowledge base for entities that match it. Let’s take an example:
Chicago Bears defensive tackle Tank Johnson was sentenced to jail on Thursday for violating probation on a 2005 weapons conviction. A cook county Judge gave Johnson a prison sentence that the Chicago Tribune reported on its website to be 120 days. According to the report, Johnson also was given 84 days of home confinement and fined 2,500 dollars. Johnson, who has a history of gun violations, faced up to one year in prison. “We continue our support of Tank and he will remain a member of our football team,” the Bears said in a statement. “Tank has made many positive changes to better his life. We believe he will continue on this path at the conclusion of his sentence.” A 2004 second-round pick, Johnson pleaded guilty to a misdemeanor gun possession charge in November 2005 and was placed on 18 months probation. Johnson was arrested December 14 when police raided his home and found three handguns, three rifles and more than 500 rounds of ammunition. He pleaded guilty on January 10 to 10 misdemeanor charges stemming from the raid. Two days after his arrest, Johnson was at a Chicago nightclub when his bodyguard and housemate, Willie B. Posey, was killed by gunfire. Johnson required special permission from the court to travel out of state to play in the Super Bowl in Miami on February 4.
For the sake of simplicity, let’s assume we find two candidate entities in our knowledge base: Tank (Military vehicle) and Tank Johnson (football player). For each query-candidate pair, we calculate different statistics, or features. For example, some features of our initial ‘microblog post entity linker’ could be:
- How similar is the query to the title of the candidate?
- Does the query occur in the candidate title?
- Does the candidate title occur in the query?
- How frequently does the query occur in the text of the candidate?
For our example, this would be:
- Tank and Tank Johnson share 4 letters, and differ 7 (7)
- Yes (1)
- No (0)
- 3 times (6)
- Tank and Tank share 4 letters, and differ none (0)
- Yes (0)
- Yes (1)
- 20 times (20)
Given these features, we generate vectors of their values like so:
|Query||Candidate||feature 1||feature 2||feature 3||feature 4|
Given these features (of a multitude of examples, as opposed to just 2), we ‘train’ a machine learning algorithm. Such an algorithm aims to learn patterns from the data it receives, allowing it to predict new data. To train this algorithm, we label our examples, by assigning classes to them.
In our case we have two classes: correct examples (class: 1) and incorrect examples (class: 0). This means that for a machine learning approach, we need ground truth to train our algorithm with. Examples of query-candidate pairs where we know which is correct. Typically, we use data from previous years of the same task, to train our system on. In our example case, we know the correct entity is the 1st, so we label this as ‘correct’. The other entity(ies) is labelled ‘incorrect’.
So, that’s the general approach, but what’s new, in our approach?
We extended our initial approach with two methods that use information from the reference document (as you might have noticed, the previous features were mostly about the query, and the candidate, ignoring the reference document). In this post, I’ll talk about one of those two.
This approach takes advantage of the ‘structure’ of our knowledge base, in our case: hyperlinks between entities. For example: the text of Tank Johnson contains links to other entities like Chicago Bears (the football team Tank played for), Gary, Indiana (Tank’s place of birth), Excalibur nightclub (the place where Tank was arrested), etc. The page for the military vehicle Tank contains links to main gun, battlefield, Tanks in World War I, etc.
We assume there exists some semantic relationship between these entities and Tank Johnson (we don’t care about how exactly they are related), and we try to take advantage of this by seeing whether these ‘related entities’ occur in the reference document. The assumption is that if we find a lot of related entities for a candidate entity, it is likely to be the correct entity.
We generate features that are about these related entities in the reference document. For example, how many related entities do we find? What’s the proportion of these entities over the total amount of related entities of the candidate? We do so by searching the reference documents for surface forms of the related entities: titles of related entities, but also anchor texts (the text in blue, above) which allows us to calculate statistics to approximate the likeliness of a surface form actually linking to the entity we assume it links to. For our example, this would result in the following discovered related entities
The document is clearly about Tank Johnson. However, in this example, we see plenty of surface forms that support Tank (in green). Since Tank was convicted for gun possession, we find lots of references to weapons and arms. In this case Tank might look like a correct link too.
However, this is where the machine learning comes in. It’s not about which entity has the most related entities (even though this is an intuition behind the approach), but its about patterns that emerge after having had enough examples. Patterns that might not be directly obvious for us, mere humans. Remember that there in a typical approach, there are lots of examples (for our submission we had on average around 300 entity candidates per query). And there are similarly lots of features (again, for our submission we calculated around 75 features per query-entity pair). Either way, to cut a long story short, our context-based approach managed to correctly decide that Tank Johnson is the correct entity in this example case!