“Dynamic Collective Entity Representations for Entity Ranking” paper accepted at WSDM2016

📅 October 12, 2015 • 🕐 21:56 • 🏷 Research
Download the poster (PDF, 3.4 MB)

Download the poster (PDF, 3.4 MB)

Our paper “Dynamic Collective Entity Representations for Entity Ranking,” with Manos Tsagkias, Wouter Weerkamp, Edgar Meij and Maarten de Rijke was accepted at The 9th ACM International Conference on Web Search and Data Mining (WSDM2016). Read the extended one-page abstract (submitted to DIR 2015) here (PDF, 200kb).

Abstract: Entity ranking, i.e., successfully positioning a relevant entity at the top of the ranking for a given query, is inherently difficult due to the potential mismatch between the entity’s description in a knowledge base, and the way people refer to the entity when searching for it. To counter this issue we propose a method for constructing dynamic collective entity representations. We collect entity descriptions from a variety of sources and combine them into a single entity representation by learning to weight the content from different sources that is associated with an entity for optimal retrieval effectiveness. Our method is able to add new descriptions in real time, and learn the best representation at set time intervals as time evolves so as to capture the dynamics in how people search entities. Incorporating dynamic description sources into dynamic collective entity representations improves retrieval effectiveness by 7% over a state-of-the-art learning to rank baseline. Periodic retraining of the ranker enables higher ranking effectiveness for dynamic collective entity representations.

I will post a pre-print here soon.

Update: Cool! Our paper has been selected for presentation as a long talk at the conference.

Update 2: The extended abstract of this paper has been accepted for poster + oral presentation at the 14th Dutch-Belgian Information Retrieval Workshop (DIR 2015). I’ve uploaded the slides of my DIR talk here.

Understanding Email Traffic: Social Network Analysis Meets Language Modeling

📅 May 11, 2014 • 🕐 14:00 • 🏷 Blog and Research

In our paper “Recipient recommendation in enterprises using communication graphs and email content we study email traffic, by looking into recipient recommendation, or: given an email without recipients, can we predict to whom it should be sent? Successfully predicting this helps in understanding the underlying mechanics and structure of an email network. To model this prediction task we consider the email traffic as a network, or graph, where each unique email account (user) corresponds to a node, and edges correspond to emails sent between users (see e.g., Telecommunications network on Wikipedia).

Google does recipient recommendation (in Gmail) by considering a user’s so-called egonetwork, i.e., a single user’s previously sent and received emails. When you frequently email Alan and Bob jointly, Gmail (might) suggest you to include Alan when you compose a new message to Bob. This approach only considers previous interactions between you and others (restricted to the egonetwork), and ignores signals such as the content of an email. This means that Gmail can only start recommending users once you’ve addressed at least one recipient (technically, this isn’t recipient recommendation, but rather “CC prediction”).

We decided to see what we can do if we consider all information available in the network, i.e., both the full communication graph (beyond the user’s ego-network), and the content of all emails present in the network (intuition: if you write a message with personal topics, the intended recipient is more likely to be a friend than a coworker). In short, this comes down to combining;

  1. Social network analysis (SNA): to estimate how “close” two emailers are in the network. Hypothesizing that the closer people are, the more likely they mail. And;
  2. Language modeling (LM), to estimate how strongly an email is associated to a recipient. We estimate this by generating personal language models, for each user in the network. A language model is a statistical model that estimates for a set of documents (in our case, a user’s sent and received emails) the probability of observing a word: words that you frequently use will receive high probabilities, and words that you never use receive low probabilities. In effect, this language model corresponds to a user’s “language profile”. Representing each user through language models (that represent their communication) allows us to compare users, but also do more fancy stuff which I’ll get into later.

We model the task of recommending recipients as that of ranking users. Or, given a sender (you) and an email (the email you wrote) the task is to rank highest those users in the network that are most likely to receive your email. This ranking should happen in a streaming setting, where we update all models (language and network) for each new email that is being sent (so that we do not use “future emails” in predicting the recipients). This means that the network and language models change over time, and adapt to changes in language use, topics being discussed, but also the ‘distance’ between users in the network.

Generative model

We use a generative model to rank recipients, by estimating the probability of observing a recipient (R), given an email (E) and sender (S);


If you don’t get this, don’t worry, in human language this reads as: the probability (P) of observing recipient R, given sender S and email E. We compute this probability for each pair of users in the network, and rank the resulting probabilities to find the most likely sender & recipient pair.

In this ranking function, we consider three components to estimate this probability (see our paper for how we use Bayes’ Theorem to end up with this final ranking function). One corresponds to the email content, the other two correspond to the SNA properties;

[latex]P(R|S,E) \propto P(E|R,S) \cdot P(S|R) \cdot P(R)[/latex]

Email content

The first component ([latex]P(E|R,S)[/latex], reads: probability of observing Email E, given sender S and recipient R) leverages email content, and corresponds to the email likelihood (i.e., how likely it is for email E to be generated by the interpersonal language model (explained below) of S and R). For each user in the network we generate language models, which allows us to compare and combine communication between users in different ways. We thus model, e.g.:

  1. Each user’s incoming email-LM, modeled by taking all the emails that are sent to the user. This corresponds to “how people talk to the user”
  2. Each user’s outgoing email-LM, modeled by taking all the emails that the user has sent. This corresponds to “how the user talks to others”
  3. Each user’s joint (incoming+outgoing) LM, which is the combination of the above two.

Finally, using these different language models, we model interpersonal language models, or the communication between two users (taking all email traffic between user A and user B). See the picture below for an illustration of these different language models.


Using this method of modeling email communications can be applied for more cool things that we didn’t fully explore for this paper, e.g., finding users that use significantly different language from the rest, by comparing how much a user’s incoming, outgoing or joint LM differs from the corpus LM. Or comparing the interpersonal LM’s that are associated with a single user, to identify a significantly different one (imagine comparing your emails with coworkers to those with your boyfriend/girlfriend/spouse). Future work! (?)

Communication graph

The second component ([latex]P(S|R)[/latex], reads: probability of observing sender S given recipient R) corresponds to the closeness of sender S and candidate recipient R, in SNA terms. We explore two approaches to estimating this closeness; (1) how many times S and R co-occur in an email (i.e., are addressed together), and (2) the number of emails sent between S and R.

The third and final component ([latex]P(R)[/latex], reads: probability of observing recipient R) corresponds to the prior probability of observing candidate recipient R (i.e., how likely is it for R to receive any email at all?). We model this by (1) counting the number of emails R has received, and (2) the PageRank score of R (favoring ‘important’ recipients).


We use the notorious Enron email corpus to find the best methods to estimate our components. Then, we use a very new, and soon-to-be-released Avocado corpus to evaluate our model. In brief, I won’t go into detail of our experiments (see the paper for those!), but suffice to say that we compare the effectiveness of the email content (LM) component and the social network analysis (SNA) components. There are several findings worth mentioning:

  • Combining both components (content & social network) is favorable.
  • For highly active users (i.e., those that send LOTS of emails) the LM approach does comparatively worse. We argue that the reason is that highly active users have a larger number of highly ranked candidate recipients (because there are more highly scoring interpersonal LMs), making it more difficult to pick the right one.
  • As time progresses, the SNA component does comparatively worse. We argue that this is because the network “settles in”; consider a scenario where two users mail actively with each other for months, but then one of the two users disappears from the network (e.g., is fired from/leaves the enterprise), in such a case, our SNA component will continue to highly rank this user.
  • The LM component improves as time progresses (as it has more data to work with).

The solution for the two ‘issues’ (2nd and 3rd bullet) is to incorporate time in our models, e.g. by introducing a decay in the language modeling (older emails become less important), and edge weights in the SNA components (older interactions count less than recent ones).

Got it? Read the paper for the full story! (PDF here)

Information Retrieval at LegalTech 2014

📅 February 6, 2014 • 🕐 04:04 • 🏷 Blog and Research

Thanks to the kind lady at the registration desk I had the unexpected honor of representing the beautiful former Carribean country of the Netherlands Antilles at LegalTech 2014, the self-proclaimed largest and most important legal technology event of the year.

David Graus at LegalTech

LegalTech is an “industry conference” where attorneys, lawyers, and IT people meet up and discuss the current and future state of law and IT. Product vendors show their software and tools aimed at making the life of the modern-day attorney easier. As I work on semantic search in eDiscovery, my reasons to attend (being generously invited by Jason Baron) were;

  1. To get a better overview and understanding of eDiscovery (in the US).
  2. To see what people consider the ‘future’ or important topics within eDiscovery.
  3. To understand what the current state of the art is in tools and applications.
  4. (To plug semantic search)

Indeed, in summary, to retrieve information! (As an IR researcher does). The conference included keynotes, conference tracks, panel discussions and a huge exhibitor show where over 100 vendors of eDiscovery-related software present their products. All this fits on just three floors of the beautiful Hilton Midtown Hotel in the middle of New York.

To get a feel of the topics and themes, tracks titles included a.o. eDiscovery, Transforming eDiscovery, Big Data, Information Governance, Advanced IT, Technology in Practice, Technology and Trends Transforming the Legal World, Corporate Legal IT.


LegalTech is a playground for attorneys and lawyers, not so much PhD students who work on information extraction and semantic search. Needless to say I was far from the typical attendant (possibly the most atypical). But LegalTech proved to be an informative and valuable crash course in eDiscovery for me (I think I can tick the boxes of all 4 of the aforementioned reasons for attending).


The keynotes allowed me to get a better understanding of eDiscovery (a.o., through hearing some of the founders of the eDiscovery world), the panel discussions were very useful in getting an understanding of the open problems, challenges and future directions, and finally the trade show allowed me to get a very complete overview of what is being built and used right now in terms of eDiscovery-supporting software.

I had varying success of talking to vendors about the stuff I was interested in: technology and algorithms behind tools, and choices for including or excluding certain features and functionalities. More frequently than not would an innocently nerdy question from my part be turned around into a software salespitch. To be fair, these people were here to sell, or at least show, so this is hardly unexpected.

The tracks: my observations

During the different tracks and panel discussions I attended, I noticed a couple of things. This is by no means a complete overview of the current things that matter in eDiscovery, but just a personal report of the things I found interesting or noteworthy;

Some of the “open door” recurring themes revolved around the “man vs machine”-debate, trust in algorithms, balance in computer assisted review vs manual review, the intricacies of algorithm performance measurement, and where Moore’s law will bring the law world in 5-10 years. Highly relevant issues for attorneys, lawyers and eDiscovery vendors, but things that I take for granted, and consider the starting point (default win for algorithms!). However, it seems like this is a debate that is not yet settled in this domain, it also seems that while everyone accepts computer assisted review as the unavoidable future, it seems still unclear what this unavoidable future exactly will look like.

On multiple occasions I heard video and image retrieval being mentioned as important future directions for eDiscovery (good news for some colleagues at the University of Amsterdam down the hall). Also, the challenge of privacy and data ownership in a mobile world, where enterprise and personal data are mixed and spread out across iPads, smartphones, laptops and clouds, were identified as major future hurdles.

Finally, in the session titled “Have we Reached a “John Henry” Moment in Evidentiary Search?” the panelists (which included Jason Baron and Ralph Losey) touched upon using eDiscovery tools and algorithms for information governance. Currently, methods are being developed to detect, reconstruct, classify or find events of interest: after the fact. Couldn’t these be used in a predictive setting, instead of a retro-spective one; learning to predict bad stuff before it happens. Interesting stuff.

The tradeshow: metadata-heavy


What I noticed particularly at the trade show was that there was a large overlap both in tools’ functionality and features and their looks and designs. But what I found more striking is the heavy focus on metadata. The tools typically use metadata such as timestamps, authors, and document types to allow users to drill down through a dataset, filtering for time periods, keywords, authors, or a combination of all of these.

Visualizations a plenty, with the most frequent ones being Google Ngrams-ish keyword histograms, and networks (graphs) of interactions between people. What was shocking for an IR/IE person like myself is that typically, once a user is done drilling down to a subset of document, he is designated to prehistoric keyword search to explore and understand the content of the set of documents. Oh no!

But for someone who’s spending 4 years of his life to enabling semantic search in this domain this isn’t worrying, but rather promising! After talking to vendors I learned that plenty of them are interested in these kind of features and functionalities, so there is definitely room for innovation here. (However to be fair, whether the target users agree might be another question).


Anyway, this ‘metadata heaviness’ is obviously a gross oversimplification and generalization, and there were definitely some interesting companies that stood out for me. Here’s a small, incomplete, and biased summary;

  • I had some nice talks with the folks at CatalystSecure, who’s senior applied research scientist and former IR academic (dr. Jeremy Pickens) was the ideal companion to be unashamedly nerdy with, talking about classification performance metrics, challenges in evaluating the “whole package” of the eDiscovery process, and awesome datasets.
  • RedOwl Analytics do some very impressive stuff with behavioural analytics, where they collect statistics for each ‘author’ in their data (such as number of emails sent and received, ‘time to respond’, number of times cc’ed), to get an ‘average baseline’ of a single dataset (enterprise), that they can use to recognize individuals who deviate from this average. The impressive part was that they were then able to map these deviations to behavioural traits (such as ‘probability of an employee leaving a company’, or on the other side of the spectrum identifying the ‘top employees’ that otherwise remain under the radar). How that works under the hood remains a mystery for me, but the type of questions they were able to answer in the demo were impressive.
  • Recommind‘s CORE platform seems to rely heavily on topic modeling, and was able to infer topics from datasets. In doing so, Recommind shows we can indeed move beyond keyword search in a real product (and outside of academic papers ;-) ). This doesn’t come as a surprise, seeing that Recommind’s CTO dr. Jan Puzicha is of probabilistic latent semantic indexing (/analysis) fame.

What’s next?

As I hinted at before, I’m missing some more content-heavy functionalities, e.g., (temporal) entity and relation extraction, identity normalization, maybe (multi document) summarization? Conveniently, this is exactly what me and my group are working on! I suppose the eDiscovery world just doesn’t know what they’re missing, yet ;-).

Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams

📅 December 3, 2013 • 🕐 13:57 • 🏷 Blog and Research

Our paper “Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams” with Manos Tsagkias, Lars Buitinck, and Maarten de Rijke got accepted as a full paper to ECIR 2014! See a preprint here:

  • [PDF] [DOI] D. Graus, M. Tsagkias, L. Buitinck, and M. de Rijke, “Generating pseudo-ground truth for predicting new concepts in social streams,” in Advances in information retrieval, Cham, 2014, p. 286–298.
    author={Graus, David and Tsagkias, Manos and Buitinck, Lars and de Rijke, Maarten},
    title={Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams},
    booktitle={Advances in Information Retrieval},
    publisher={Springer International Publishing},
    series = {ECIR '14}

Layman explanation

This blog post is intended as a high level overview of what we did. Remember my last post on entity linking? In this paper we want to do entity linking on entities that are not (yet) on Wikipedia, or:

Recognizing (finding) and classifying (determining their type: persons, locations or organizations) unknown (not in the knowledge base) entities on Twitter (this is where we want to find them)

These entities might be unknown because they are newly surfacing (e.g. a new popstar that breaks through), or because they are so-called ‘long tail’ entities (i.e. very infrequently occurring entities).


To detect these entities, we generate training data, to train a supervised named-entity recognizer and classifier (NERC). Training data is hard to come by: it is expensive to have people manually label Tweets, and you need enough of these labels to make it work. We automate this processing by using the output of an entity linker to label Tweets. The advantage is this is very cheap and easy to create a large set of training data. The disadvantage is that there might be more noise: wrong labels, or bad tweets that do not contain enough information to learn patterns to recognize the types of entities we are looking for.

To address this latter obstacle, we apply several methods to filter Tweets which we deem ‘nice’. One of these methods involves scoring Tweets based on their noise. We applied very simple features to determine this ‘noise-level’ of a tweet; amongst others how many mentions (@’s), hashtags (#’s) and URLs it contains, but also the ratio between upper-case to lower-case letters, the average word length, the tweet’s length, etc. An example for this Twitter-noise-score is below (these are Tweets from the TREC 2011 Microblog corpus we used):

Top 5 quality Tweets

  1. Watching the History channel, Hitler’s Family. Hitler hid his true family heritage, while others had to measure up to Aryan purity.
  2. When you sense yourself becoming negative, stop and consider what it would mean to apply that negative energy in the opposite direction.
  3. So. After school tomorrow, french revision class. Tuesday, Drama rehearsal and then at 8, cricket training. Wednesday, Drama. Thursday … (c)
  4. These late spectacles were about as representative of the real West as porn movies are of the pizza delivery business Que LOL
  5. Sudan’s split and emergence of an independent nation has politico-strategic significance. No African watcher should ignore this.

Top 5 noisy Tweets

  1. Toni Braxton ~ He Wasnt Man Enough for Me _HASHTAG_ _HASHTAG_? _URL_ RT _Mention_
  2. tell me what u think The GetMore Girls, Part One _URL_
  3. this girl better not go off on me rt
  4. you done know its funky! — Bill Withers “Kissing My Love” _URL_ via _Mention_
  5. This is great: _URL_ via _URL_

In addition, we filter Tweets based on the confidence score of the entity linker, so as not to include Tweets that contain unlikely labels.

Experimental Setup

It is difficult to measure how well we do in finding entities that do not exist on Wikipedia, since we need some sort of ground truth to determine whether we did well or not. As we cannot manually check for 80.000 Tweets whether the identified entities are in or out of Wikipedia, we take a slightly theoretical approach.

If I were to put it in a picture (and I did, conveniently), it’d look like this:


In brief, we take small ‘samples’ of Wikipedia: one such sample represent the “present KB”; the initial state of the KB. The samples are created by removing out X% of the Wikipedia pages (from 10% to 90% in steps of 10). We then label Tweets using the full KB (100%) to create the ground truth: this full KB represents the “future KB”. Our “present KB” then labels the Tweets it knows, and uses the Tweets it cannot link as sources for new entities. If then the NERC (trained on the Tweets labeled by the present KB) manages to identify entities in the set of “unlinkable” Tweets, we can compare the predictions to the ground truth, and measure performance.

Results & Findings

We report on standard metrics: Precision & Recall, on two levels: entity and mention level. However, I won’t go into any details here, because I encourage you to read the results and findings in the paper.


Several Dutch media have picked up our work:

Original press release:

Press coverage:


Slides of my talk at #ECIR2014 are now up on Slideshare;

[slideshare id=33633324&doc=20140401ecir14newconceptssocialstreams-140417021202-phpapp01]

yourHistory – Entity linking for a personalized timeline of historic events

📅 September 14, 2013 • 🕐 13:31 • 🏷 Blog and Research

Download a pre-print of Graus, D., Peetz, M-H., Odijk, D., de Rooij, Ork., de Rijke, M. “yourHistory — Semantic linking for a personalized timeline of historic events,” in CEUR Workshop Proceedings, 2014.

Update #1

I presented yourHistory at ICT.OPEN 2013:

The slides of my talk are up on SlideShare:

yourHistory – entity linking for a personalized timeline of historic events from David Graus

And we got nominated for the “Innovation & Entrepreneurship Award” there! (sadly, didn’t win though ;) ).


Original Post

yourHistory - OKConference poster

For the LinkedUp Challenge Veni competition at the Open Knowledge Conference (OKCon), we (Maria-Hendrike Peetz, me, Daan Odijk, Ork de Rooij and Maarten de Rijke) created yourHistory; a Facebook app that uses entity linking for personalized historic timeline generation (using d3.js). Our app got shortlisted (top 8 out of 22 submissions) and is in the running for the first prize of 2000 euro!

Read a small abstract here:

In history we often study dates and events that have little to do with our own life. We make history tangible by showing historic events that are personal and based on your own interests (your Facebook profile). Often, those events are small-scale and escape history books. By linking personal historic events with global events, we to link your life with global history: writing your own personal history book.

Read the full story here;

And try out the app here!

It’s currently still a little rough around the edges. There’s an extensive to-do list, but if you have any feedback or remarks, don’t hesitate to leave me a message below!

How many things took place between 1900 and today? DBPedia knows

📅 June 7, 2013 • 🕐 16:42 • 🏷 Blog and Research

For a top-secret project, I am looking at retrieving all entities that represent a ‘(historic) event’, from DBPedia.

Now I could rant about how horrible it is to actually formulate a ‘simple’ query like this, using the structured anarchistic Linked Data format, so I will: this request “give me all entities that represent ‘events’ from DBPedia” takes me 3 SPARQL queries, since different predicates represent the same thing, but probably I need a lot more to get a proper subset of the entities I’m looking for. Currently, I filter for entities that have a dbpedia-owl:date property, a dbprop:date property (yes, these predicated express the exact same property) and entities that belong to the Event class.

Anyway, if we count for each year how many event entities there are, we get the following graph:


Which is interesting, because it shows how there are loads of events in the near past, and around WWII, and around WWI. I could now say something about how interesting it is that our collective memory is focused on the near past, but then I looked at the events and saw loads of sports events, so I won’t, but rather say that back in the days we were terrible at organizing sports events. Still, the knowledge that between 1900 and today a total of 16.589 events happened seems significant to me.

We won the WoLE2013 Challenge

📅 May 14, 2013 • 🕐 14:07 • 🏷 Blog and Research

With our SemanticTED demo, we (Daan OdijkEdgar Meij, Tom Kenter and me) won the Web of Linked Entities 2013 Workshop’s “Doing Good by Linking Entities” Developers Challenge (at WWW2013).

Read the paper of our submission here:

  • [PDF] D. Odijk, E. Meij, D. Graus, and T. Kenter, “Multilingual semantic linking for video streams: making “ideas worth sharing” more accessible,” in Proceedings of the 2nd international workshop on web of linked entities (wole 2013), 2013.
    title={Multilingual semantic linking for video streams: Making “ideas worth sharing” more accessible},
    author={Odijk, Daan and Meij, Edgar and Graus, David and Kenter, Tom},
    booktitle={Proceedings of the 2nd International Workshop on Web of Linked Entities (WoLE 2013)},

Now we get to share an iPad.



Context-based Entity Linking

📅 February 2, 2013 • 🕐 15:52 • 🏷 Blog and Research

The goal of this post is to make the research I’m doing understandable to the general public. You know, to explain what I’m doing in a way not my peers, but my parents would understand. In part because the majority of the returning visitors of my blog are composed of my parents, in part because lots of people think it’s a good idea for scientists to blog about their work, and in part because I like blogging. And finally, I suppose, because this research is made possible by people who pay their taxes ;-).

In this post I’ll try to explain the paper ‘Context-Based Entity Linking – University Of Amsterdam at TAC 2012’ I wrote with Edgar Meij, Tom Kenter, Marc Bron and Maarten de Rijke. It will also hopefully provide some basic understanding of machine learning.

Paper: ‘Context-Based Entity Linking – University Of Amsterdam at TAC 2012’ (131.24 KB)
Poster: Here

Entity Linking

Entity linking is the task of linking a word in a piece of text, to an ‘entity’ or ‘concept’ from a knowledge base (think: Wikipedia). Why would we want to? Because it allows us to automatically detect what is being talked about in a document, as opposed to seeing what words it is composed of. It allows us to generate extra context, it allows us to generate metadata which can improve searching and archiving. It moves one step beyond simple word-based analysis. We want that.

entity linking example

The Text Analysis Conference is a yearly ‘benchmark event’, where a dataset is provided (lots of documents, a knowledge base, and a list of queries, words or ‘entity mentions’ that occur in the documents). I describe the task in more detail here. We participated in this track, by building and modifying a system that was created for entity-linking tweets.

We start by taking our query, and search the knowledge base for entities that match it. Let’s take an example:

Query: Tank
Reference document:

Chicago Bears defensive tackle Tank Johnson was sentenced to jail on Thursday for violating probation on a 2005 weapons conviction. A cook county Judge gave Johnson a prison sentence that the Chicago Tribune reported on its website to be 120 days. According to the report, Johnson also was given 84 days of home confinement and fined 2,500 dollars. Johnson, who has a history of gun violations, faced up to one year in prison. “We continue our support of Tank and he will remain a member of our football team,” the Bears said in a statement. “Tank has made many positive changes to better his life. We believe he will continue on this path at the conclusion of his sentence.” A 2004 second-round pick, Johnson pleaded guilty to a misdemeanor gun possession charge in November 2005 and was placed on 18 months probation. Johnson was arrested December 14 when police raided his home and found three handguns, three rifles and more than 500 rounds of ammunition. He pleaded guilty on January 10 to 10 misdemeanor charges stemming from the raid. Two days after his arrest, Johnson was at a Chicago nightclub when his bodyguard and housemate, Willie B. Posey, was killed by gunfire. Johnson required special permission from the court to travel out of state to play in the Super Bowl in Miami on February 4.

For the sake of simplicity, let’s assume we find two candidate entities in our knowledge base: Tank (Military vehicle) and Tank Johnson (football player). For each query-candidate pair, we calculate different statistics, or features. For example, some features of our initial ‘microblog post entity linker’ could be:

  1. How similar is the query to the title of the candidate?
  2. Does the query occur in the candidate title?
  3. Does the candidate title occur in the query?
  4. How frequently does the query occur in the text of the candidate?

For our example, this would be:

Tank Johnson

  1. Tank and Tank Johnson share 4 letters, and differ 7 (7)
  2. Yes (1)
  3. No (0)
  4. 3 times (6)


  1. Tank and Tank share 4 letters, and differ none (0)
  2. Yes (0)
  3. Yes (1)
  4. 20 times (20)

Given these features, we generate vectors of their values like so:

Query Candidate feature 1 feature 2 feature 3 feature 4
Tank Tank Johnson 7 1 0 6
Tank Tank 0 0 1 20

Given these features (of a multitude of examples, as opposed to just 2), we ‘train’ a machine learning algorithm. Such an algorithm aims to learn patterns from the data it receives, allowing it to predict new data. To train this algorithm, we label our examples, by assigning classes to them.

In our case we have two classes: correct examples (class: 1) and incorrect examples (class: 0). This means that for a machine learning approach, we need ground truth to train our algorithm with. Examples of query-candidate pairs where we know which is correct. Typically, we use data from previous years of the same task, to train our system on. In our example case, we know the correct entity is the 1st, so we label this as ‘correct’. The other entity(ies) is labelled ‘incorrect’.

So, that’s the general approach, but what’s new, in our approach?

Add Context

We extended our initial approach with two methods that use information from the reference document (as you might have noticed, the previous features were mostly about the query, and the candidate, ignoring the reference document). In this post, I’ll talk about one of those two.

Hyperlinks = related entities

Hyperlinks = related entities

This approach takes advantage of the ‘structure’ of our knowledge base, in our case: hyperlinks between entities. For example: the text of Tank Johnson contains links to other entities like Chicago Bears (the football team Tank played for), Gary, Indiana (Tank’s place of birth), Excalibur nightclub (the place where Tank was arrested), etc. The page for the military vehicle Tank contains links to main gun, battlefield, Tanks in World War I, etc.

We assume there exists some semantic relationship between these entities and Tank Johnson (we don’t care about how exactly they are related), and we try to take advantage of this by seeing whether these ‘related entities’ occur in the reference document. The assumption is that if we find a lot of related entities for a candidate entity, it is likely to be the correct entity.

We generate features that are about these related entities in the reference document. For example, how many related entities do we find? What’s the proportion of these entities over the total amount of related entities of the candidate? We do so by searching the reference documents for surface forms of the related entities: titles of related entities, but also anchor texts (the text in blue, above) which allows us to calculate statistics to approximate the likeliness of a surface form actually linking to the entity we assume it links to. For our example, this would result in the following discovered related entities

Screen Shot 2013-02-02 at 15.14.51 PM

The document is clearly about Tank Johnson. However, in this example, we see plenty of surface forms that support Tank (in green). Since Tank was convicted for gun possession, we find lots of references to weapons and arms. In this case Tank might look like a correct link too.

However, this is where the machine learning comes in. It’s not about which entity has the most related entities (even though this is an intuition behind the approach), but its about patterns that emerge after having had enough examples. Patterns that might not be directly obvious for us, mere humans. Remember that there in a typical approach, there are lots of examples (for our submission we had on average around 300 entity candidates per query). And there are similarly lots of features (again, for our submission we calculated around 75 features per query-entity pair). Either way, to cut a long story short, our context-based approach managed to correctly decide that Tank Johnson is the correct entity in this example case!