|Title||Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams|
|Author||David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke|
|Publication type||Full paper|
|Conference name||36th European Conference on Information Retrieval (ECIR ’14)|
|Conference location||Amsterdam, The Netherlands|
|Abstract||The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.|
|Full paper||PDF [256 KB]|
Several Dutch media have picked up our work:
Original press release:
- English: New method supplements Wikipedia with Twitter topics
- Dutch: Nieuwe methode vult Wikipedia aan met onderwerpen Twitter
- Tweakers.net: Nieuw algoritme vult Wikipedia aan op basis van Twitterberichten
- DeMorgen.be: Nieuw algoritme vult Wikipedia aan op basis van tweets
- Emerce: Nederlands algoritme reikt nieuwe Wikipedia onderwerpen aan
- Twittermania: Wordt Wikipedia straks gevuld via tweets?
- Z24: Algoritme gebruikt Twitter om nieuwe Wikipedia-artikelen te voorspellen
Slides of my talk at #ECIR2014 are now up on Slideshare;
Download a pre-print: Graus, D., Tsagkias, E., Buitinck, L., & de Rijke, M., “Generating pseudo-ground truth for predicting new concepts in social streams,” in 36th European Conference on Information Retrieval (ECIR’14), 2014. [PDF, 258KB]
The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.
This blog post is intended as a high level overview of what we did. Remember my last post on entity linking? In this paper we want to do entity linking on entities that are not (yet) on Wikipedia, or:
Recognizing (finding) and classifying (determining their type: persons, locations or organizations) unknown (not in the knowledge base) entities on Twitter (this is where we want to find them)
These entities might be unknown because they are newly surfacing (e.g. a new popstar that breaks through), or because they are so-called ‘long tail’ entities (i.e. very infrequently occurring entities).
To detect these entities, we generate training data, to train a supervised named-entity recognizer and classifier (NERC). Training data is hard to come by: it is expensive to have people manually label Tweets, and you need enough of these labels to make it work. We automate this processing by using the output of an entity linker to label Tweets. The advantage is this is very cheap and easy to create a large set of training data. The disadvantage is that there might be more noise: wrong labels, or bad tweets that do not contain enough information to learn patterns to recognize the types of entities we are looking for.
To address this latter obstacle, we apply several methods to filter Tweets which we deem ‘nice’. One of these methods involves scoring Tweets based on their noise. We applied very simple features to determine this ‘noise-level’ of a tweet; amongst others how many mentions (@’s), hashtags (#’s) and URLs it contains, but also the ratio between upper-case to lower-case letters, the average word length, the tweet’s length, etc. An example for this Twitter-noise-score is below (these are Tweets from the TREC 2011 Microblog corpus we used):
Top 5 quality Tweets
- Watching the History channel, Hitler’s Family. Hitler hid his true family heritage, while others had to measure up to Aryan purity.
- When you sense yourself becoming negative, stop and consider what it would mean to apply that negative energy in the opposite direction.
- So. After school tomorrow, french revision class. Tuesday, Drama rehearsal and then at 8, cricket training. Wednesday, Drama. Thursday … (c)
- These late spectacles were about as representative of the real West as porn movies are of the pizza delivery business Que LOL
- Sudan’s split and emergence of an independent nation has politico-strategic significance. No African watcher should ignore this.
Top 5 noisy Tweets
- Toni Braxton ~ He Wasnt Man Enough for Me _HASHTAG_ _HASHTAG_? _URL_ RT _Mention_
- tell me what u think The GetMore Girls, Part One _URL_
- this girl better not go off on me rt
- you done know its funky! — Bill Withers “Kissing My Love” _URL_ via _Mention_
- This is great: _URL_ via _URL_
In addition, we filter Tweets based on the confidence score of the entity linker, so as not to include Tweets that contain unlikely labels.
It is difficult to measure how well we do in finding entities that do not exist on Wikipedia, since we need some sort of ground truth to determine whether we did well or not. As we cannot manually check for 80.000 Tweets whether the identified entities are in or out of Wikipedia, we take a slightly theoretical approach.
If I were to put it in a picture (and I did, conveniently), it’d look like this:
In brief, we take small ‘samples’ of Wikipedia: one such sample represent the “present KB”; the initial state of the KB. The samples are created by removing out X% of the Wikipedia pages (from 10% to 90% in steps of 10). We then label Tweets using the full KB (100%) to create the ground truth: this full KB represents the “future KB”. Our “present KB” then labels the Tweets it knows, and uses the Tweets it cannot link as sources for new entities. If then the NERC (trained on the Tweets labeled by the present KB) manages to identify entities in the set of “unlinkable” Tweets, we can compare the predictions to the ground truth, and measure performance.
Results & Findings
We report on standard metrics: Precision & Recall, on two levels: entity and mention level. However, I won’t go into any details here, because I encourage you to read the results and findings in the paper.
Download a pre-print of Graus, D., Peetz, M-H., Odijk, D., de Rooij, Ork., de Rijke, M. “yourHistory — Semantic linking for a personalized timeline of historic events,” in CEUR Workshop Proceedings, 2014.
I presented yourHistory at ICT.OPEN 2013:
— Lora Aroyo (@laroyo) November 27, 2013
The slides of my talk are up on SlideShare:
And we got nominated for the “Innovation & Entrepreneurship Award” there! (sadly, didn’t win though ;) ).
For the LinkedUp Challenge Veni competition at the Open Knowledge Conference (OKCon), we (Maria-Hendrike Peetz, me, Daan Odijk, Ork de Rooij and Maarten de Rijke) created yourHistory; a Facebook app that uses entity linking for personalized historic timeline generation (using d3.js). Our app got shortlisted (top 8 out of 22 submissions) and is in the running for the first prize of 2000 euro!
Read a small abstract here:
In history we often study dates and events that have little to do with our own life. We make history tangible by showing historic events that are personal and based on your own interests (your Facebook profile). Often, those events are small-scale and escape history books. By linking personal historic events with global events, we to link your life with global history: writing your own personal history book.
Read the full story here;
And try out the app here!
It’s currently still a little rough around the edges. There’s an extensive to-do list, but if you have any feedback or remarks, don’t hesitate to leave me a message below!
The goal of this post is to make the research I’m doing understandable to the general public. You know, to explain what I’m doing in a way not my peers, but my parents would understand. In part because the majority of the returning visitors of my blog are composed of my parents, in part because lots of people think it’s a good idea for scientists to blog about their work, and in part because I like blogging. And finally, I suppose, because this research is made possible by people who pay their taxes ;-).
In this post I’ll try to explain the paper ‘Context-Based Entity Linking – University Of Amsterdam at TAC 2012’ I wrote with Edgar Meij, Tom Kenter, Marc Bron and Maarten de Rijke. It will also hopefully provide some basic understanding of machine learning.
Paper: ‘Context-Based Entity Linking – University Of Amsterdam at TAC 2012’ (131.24 KB)
Entity linking is the task of linking a word in a piece of text, to an ‘entity’ or ‘concept’ from a knowledge base (think: Wikipedia). Why would we want to? Because it allows us to automatically detect what is being talked about in a document, as opposed to seeing what words it is composed of. It allows us to generate extra context, it allows us to generate metadata which can improve searching and archiving. It moves one step beyond simple word-based analysis. We want that.
The Text Analysis Conference is a yearly ‘benchmark event’, where a dataset is provided (lots of documents, a knowledge base, and a list of queries, words or ‘entity mentions’ that occur in the documents). I describe the task in more detail here. We participated in this track, by building and modifying a system that was created for entity-linking tweets.
We start by taking our query, and search the knowledge base for entities that match it. Let’s take an example:
Chicago Bears defensive tackle Tank Johnson was sentenced to jail on Thursday for violating probation on a 2005 weapons conviction. A cook county Judge gave Johnson a prison sentence that the Chicago Tribune reported on its website to be 120 days. According to the report, Johnson also was given 84 days of home confinement and fined 2,500 dollars. Johnson, who has a history of gun violations, faced up to one year in prison. “We continue our support of Tank and he will remain a member of our football team,” the Bears said in a statement. “Tank has made many positive changes to better his life. We believe he will continue on this path at the conclusion of his sentence.” A 2004 second-round pick, Johnson pleaded guilty to a misdemeanor gun possession charge in November 2005 and was placed on 18 months probation. Johnson was arrested December 14 when police raided his home and found three handguns, three rifles and more than 500 rounds of ammunition. He pleaded guilty on January 10 to 10 misdemeanor charges stemming from the raid. Two days after his arrest, Johnson was at a Chicago nightclub when his bodyguard and housemate, Willie B. Posey, was killed by gunfire. Johnson required special permission from the court to travel out of state to play in the Super Bowl in Miami on February 4.
For the sake of simplicity, let’s assume we find two candidate entities in our knowledge base: Tank (Military vehicle) and Tank Johnson (football player). For each query-candidate pair, we calculate different statistics, or features. For example, some features of our initial ‘microblog post entity linker’ could be:
- How similar is the query to the title of the candidate?
- Does the query occur in the candidate title?
- Does the candidate title occur in the query?
- How frequently does the query occur in the text of the candidate?
For our example, this would be:
- Tank and Tank Johnson share 4 letters, and differ 7 (7)
- Yes (1)
- No (0)
- 3 times (6)
- Tank and Tank share 4 letters, and differ none (0)
- Yes (0)
- Yes (1)
- 20 times (20)
Given these features, we generate vectors of their values like so:
|Query||Candidate||feature 1||feature 2||feature 3||feature 4|
Given these features (of a multitude of examples, as opposed to just 2), we ‘train’ a machine learning algorithm. Such an algorithm aims to learn patterns from the data it receives, allowing it to predict new data. To train this algorithm, we label our examples, by assigning classes to them.
In our case we have two classes: correct examples (class: 1) and incorrect examples (class: 0). This means that for a machine learning approach, we need ground truth to train our algorithm with. Examples of query-candidate pairs where we know which is correct. Typically, we use data from previous years of the same task, to train our system on. In our example case, we know the correct entity is the 1st, so we label this as ‘correct’. The other entity(ies) is labelled ‘incorrect’.
So, that’s the general approach, but what’s new, in our approach?
We extended our initial approach with two methods that use information from the reference document (as you might have noticed, the previous features were mostly about the query, and the candidate, ignoring the reference document). In this post, I’ll talk about one of those two.
This approach takes advantage of the ‘structure’ of our knowledge base, in our case: hyperlinks between entities. For example: the text of Tank Johnson contains links to other entities like Chicago Bears (the football team Tank played for), Gary, Indiana (Tank’s place of birth), Excalibur nightclub (the place where Tank was arrested), etc. The page for the military vehicle Tank contains links to main gun, battlefield, Tanks in World War I, etc.
We assume there exists some semantic relationship between these entities and Tank Johnson (we don’t care about how exactly they are related), and we try to take advantage of this by seeing whether these ‘related entities’ occur in the reference document. The assumption is that if we find a lot of related entities for a candidate entity, it is likely to be the correct entity.
We generate features that are about these related entities in the reference document. For example, how many related entities do we find? What’s the proportion of these entities over the total amount of related entities of the candidate? We do so by searching the reference documents for surface forms of the related entities: titles of related entities, but also anchor texts (the text in blue, above) which allows us to calculate statistics to approximate the likeliness of a surface form actually linking to the entity we assume it links to. For our example, this would result in the following discovered related entities
The document is clearly about Tank Johnson. However, in this example, we see plenty of surface forms that support Tank (in green). Since Tank was convicted for gun possession, we find lots of references to weapons and arms. In this case Tank might look like a correct link too.
However, this is where the machine learning comes in. It’s not about which entity has the most related entities (even though this is an intuition behind the approach), but its about patterns that emerge after having had enough examples. Patterns that might not be directly obvious for us, mere humans. Remember that there in a typical approach, there are lots of examples (for our submission we had on average around 300 entity candidates per query). And there are similarly lots of features (again, for our submission we calculated around 75 features per query-entity pair). Either way, to cut a long story short, our context-based approach managed to correctly decide that Tank Johnson is the correct entity in this example case!
|Title||Context-Based Entity Linking – University Of Amsterdam At TAC 2012 [link]|
|Author||Graus, D.P., Kenter, T.M., Bron, M.M., Meij, E.J., de Rijke, M.|
|Publication type||Conference Proceedings|
|Conference name||Text Analysis Conference 2012|
|Conference location||Gaithersburg, MD|
|Abstract||This paper describes our approach to the 2012 Text Analysis Conference (TAC) Knowledge Base Population (KBP) entity linking track. For this task, we turn to a state-of-the-art system for entity linking in microblog posts. Compared to the little context microblog posts provide, the documents in the TAC KBP track provide context of greater length and of a less noisy nature. In this paper, we adapt the entity linking system for microblog posts to the KBP task by extending it with approaches that explicitly rely on the query’s context. We show that incorporating novel features that leverage the context on the entity-level can lead to improved performance in the TAC KBP task.|
|Full paper||PDF (131.42 KB)|
The last couple of weeks I’ve been diving into the task of entity linking, in the context of a submission to the Text Analysis Conference Knowledge Base Population track (that’s quite a mouthful – TAC KBP from now on). A ‘contest’ in Knowledge Base Population with a standardized task, dataset and evaluation. Before I’m devoting my next post to our submission, let me first explain the task of entity linking in this post :-).
Knowledge Base Population
Knowledge Base Population is an information extraction task of generating knowledge bases from raw, unstructured text. A Knowledge Base is essentially a database which describes unique concepts (things) and contains information about these concepts. Wikipedia is a Knowledge Base: each article represents a unique concept, the article’s body contains information about the concept, the infobox provides ‘structured’ information, and links to other pages provide ‘semantic’ (or rather: relational) information.
Filling a Knowledge Base from raw textual data can be broadly split up into two subtasks: entity linking (identifying unique entities in a corpus of documents to add to the Knowledge Base) and slot filling (finding relations of and between these unique entities, and add these to the concepts in the KB).
Entity linking is the task of linking a concept (thing) to a mention (word/words) in a document. Not unlike semantic annotation, this task is essentially about defining the meaning of a word, by assigning the correct concept to it. Consider these examples:
Bush went to war in Iraq
Bush won over Dukakis
A bush under a tree in the forest
A tree in the forest
A hierarchical tree
It is clear that the bushes and trees in these sentences refer to different concepts. The idea is to link the bold words to the people or things they refer to. To complete this task, we are given:
- KB: a Wikipedia-derived knowledge base containing concepts, people, places, etc. Each concept has a title, text (the Wikipedia page’s content), some metadata (infobox properties), etc.
- DOC: the (context) document in which the entity we are trying to link occurs (in the case of the examples: the sentence)
- Q: The entity-mention query in the document (the word)
The goal is to identify whether the entity referred to by Q is in the KB: this means that if it isn’t, it should be considered a ‘new’ entity. All the new entities should be clustered; that means when two documents refer to the same ‘new’ entity, this must be represented in assigning the same new ID to both words.
A common approach
1. Query Expansion: This means finding more surface forms that refer to entity Q. Two approaches:
I: This can be derived from the document itself, using for example ‘coreference resolution’. In this case you try to identify all strings referring to the same entity. In the Bush example, you might find “President Bush” or “George W. Bush” somewhere in the document.
II: This can also be done by using external information sources, such as looking up the word in a database of Wikipedia anchor texts, page titles, redirect strings or disambiguation pages. Using Wikipedia to expand ‘Q=cheese’ could lead to:
Title: Cheese. Redirects: Home cheesemaking, Cheeses, CHEESE, Cheeze, Chees, Chese, Coagulated milk curd. Anchors: cheese, Cheese, cheeses, Cheeses, maturation, CHEESE, 450 varieties of cheese, Queso, Soft cheese, aging process of cheese, chees, cheese factor, cheese wheel, cheis, double and triple cream cheese, formaggi, fromage, hard cheeses, kebbuck, masification, semi-hard to hard, soft cheese, soft-ripened, wheel of cheese, Fromage, cheese making, cheese-making, cheesy, coagulated, curds, grated cheese, hard, lyres, washed rind, wheels.
2. Candidate Generation: For each surface form of Q, try to find KB entries that could be referred to. Simple approaches are searching for Wikipedia titles that contain the surface form, looking through anchor link texts (titles used to refer to a specific Wikipedia page in another Wikipedia page), expanding acronyms (if you find a string containing only uppercase-letters, try to find a matching word sequence).
3. Candidate Ranking: The final step would be selecting the most probable candidate from the previous step. Simple approaches can be comparing the similarity of the context document to each candidate document (Wikipedia page), more advanced approaches involve measuring semantic similarity on higher levels: e.g. by finding ‘related’ entities in the context document.
4. NIL Clustering: Whenever no candidate can be found (or only candidates with a low probability of being the right one – measured in any way), it could be decided that the entity referred to is not in the KB. In this case, the job is to assign a new ID to the entity, and if it ever is referred in a later document, attach this same ID. This is a matter of (unsupervised) clustering. Successfull approaches include simple string similarity (same ‘new’ entities being referred to by the same word), document similarity (using simple comparisons) or more advanced clustering approaches such as LDA/HDP.
Now this is just a general introduction. If you are interested in the technical side of some common approaches, take a look at the TAC Proceedings. Particularly the Overview of the TAC2011 Knowledge Base Population Trac [PDF], and the Proceedings Papers. If you are unsure why Wikipedia is an awesome resource for entity linking (or any form of extracting structured information from unstructured text), I’d recommend you read ‘Mining Meaning from Wikipedia‘ (Medelyan et al., 2009)
Next post will hopefully be about ILPS’ award-winning entity linking system, so stay tuned ;-).