Read a pre-print of our paper below:
In our latest paper we study the problem of entity ranking. In search engines, people often search for entities; real-life “things” (people, places, companies, movies, etc.). Google, Bing, Yahoo, DuckDuckGo, all big web search engines cater to this type of information need by displaying knowledge panels (they go by many names; but little snippets that show a summary of information related to an entity). You’ve seen this before, but if you haven’t, see the picture below;
One challenge in giving people the entities they search for is that of vocabulary mismatch; people use many different ways to search for entities. Well-formed queries like “Kendrick Lamar” may be a large chunk, but just as well, you’ll find people searching for “k.dot,” or even more abstract/descriptive queries when users do not exactly remember the name of who they are looking for.
Another example is when events unfold in the real world, e.g., Michael Brown being killed by cops in Ferguson. As soon as this happens, and news media starts reporting it, people may start looking for relevant entities (Ferguson) by searching for previously unassociated words, e.g., “police shooting missouri.”
A final example (also in our paper) is shown below. The entity Anthropornis has a small and matter-of-factual description on Wikipedia (it is a stub);
But on Twitter, Brody Brooks refers to this particular species of penguin in the following way;
Baddest motherfucking penguin there ever was. http://t.co/WBACdddL
— Brody Brooks (@BrodyBr) December 8, 2012
While putting profanity in research papers is not greatly appreciated, this tweet does illustrate our point: people do refer to entities in different (and rich!) ways. The underlying idea of our method is to leverage this for free, to close the gap between the vocabulary of people, and the (formal) language of the Knowledge Base. More specifically, the idea is to enable search engines to automagically incorporate changes in search behavior for entities (“police shooting + ferguson”), and different ways in how people refer to entities (bad penguins).
So how? We propose to “expand” entity descriptions by mining content from the web. I mean add words to documents to make it easier to find the documents. We collect these words from tweets, social tags, web anchors (links on webpages), and search engine queries, all of which are somehow associated with entities. So in the case of our Anthropornis-example, the next time someone were to search for the “baddest penguin there ever was,” Anthropornis will get ranked higher.
These type of methods (document expansion) have been studied before, but what sets our setting apart from previous work are two things;
- We study our method in a dynamic scenario, i.e., we want to see how external descriptions affect the rankings in (near) real-time; what happens if people start tweeting away about an entity? How do you make sure the entity description doesn’t get swamped with additional content? Next, we;
- Combine a large number of different description sources. Which allows us to study differences between signals (tags, tweets, queries, web anchors). Are different sources complementary? Is there’s redundancy across sources? Which type of source is more effective? etc.
As usual, I won’t go into the nitty gritty details of our experimental setup, modeling and results in this post. Read the paper for that (actually, the experimental setup details are quite nitty and gritty in this case). Let’s cut to the chase: adding external descriptions to your entity representation improves entity ranking effectiveness (badum-tss)!
Furthermore, it is important to assign individual weights to the different sources, as the sources vary a lot in terms of content (tweets and queries differ in length, quality, etc.). The expansions also vary across different entities (popular entities may receive many expansions, where less popular entities may not). To balance this, we inform the ranker of the number of expansions a certain entity has received. We address all the above issues by proposing different features for our machine learning model. Finally, we show that in our dynamic scenario, it is a good idea to (periodically) retrain your ranker to re-assess these weights.
What I find attractive about our method is that it’s relatively “cheap” and simple; you simply add content (= words) to your entity representation (= document) and retrieval improves! Even if you omit the fancy machine learning re-training (detailed in our paper). Anyway, for the full details, and more pretty plots like this one, do read our paper!