ECIR 2014 Press Release

Together with the persvoorlichting of the UvA  we wrote a press release announcing our upcoming conference, check it out below.

English (translated through UvA) follows Dutch (original).

Nieuwe inzichten en ontwikkelingen in zoekmachinetechnologie (link)

European Conference on Information Retrieval

Wat kan een zoekmachine – op basis van wat je zoekt en waar je op klikt – afleiden over je identiteit en gedrag? Hoe kan ‘gamification’ ingezet worden om zoekmachines te verbeteren? En welke rol speelt het verzamelen en toegankelijk maken van verschillende datastromen in de stad van de toekomst? Deze en andere vragen worden beantwoord tijdens de 36e ‘European Conference on Information Retrieval’ (ECIR ‘14).

De conferentie, die dit jaar plaatsvindt van 13 tot en met 16 april in Amsterdam, brengt de internationale top van onderzoekers op het terrein van information retrieval (zoekmachinetechnologie) samen. Aan bod komen onderwerpen als personalisatie van zoekresultaten,recommender systems (aanbevelingssystemen), datamining in sociale media, en eCommerce en product search. Eugene Agichtein (Emory University, VS) opent ECIR ’14 met een keynote waarin hij ingaat op het afleiden van intenties en gedrag van internetgebruikers uit hun interacties met zoekmachines.

Technologische innovaties

Toegang tot (big) data en kostbare infrastructuren worden steeds belangrijker in de information retrieval. In een paneldiscussie belichten prominenten uit zowel het bedrijfsleven als de wetenschap de huidige stand van zaken en toekomstige ontwikkelingen in het onderzoeksveld. De industry day op woensdag 16 april wordt geopend met een keynote door Gilad Mishne (hoofd van het zoekteam van Twitter) over real-time zoeken op Twitter. Vervolgens presenteren (inter)nationale bedrijven, waaronder Yahoo! en eBay, hun laatste technologische innovaties.

Nederland is één van de meest vooraanstaande landen als het gaat om wetenschappelijk onderzoek in de information retrieval. De organisatie van ECIR ‘14 ligt in handen van het Intelligent Systems Lab Amsterdam (ISLA) van de Universiteit van Amsterdam, met ondersteuning van onder meer zoekgiganten als Microsoft, Yahoo!, Yandex en Google.

Locatie

Hotel Casa 400
Eerste Ringdijkstraat 4
1097 BC Amsterdam

New insights and developments in search engine technology (link)

European Conference on Information Retrieval

What can a search engine deduce about your identity and habits based on the topics you search and select? How can gamification be used to improve search engines? And what role will the collection and provision of access to diverse data flows play in the city of the future? These are just a few of the questions to be addressed during the 36th European Conference on Information Retrieval (ECIR 14).

Set to take place on 13-16 April, the conference will bring international frontrunners in the field of information retrieval (search engine technology) together in Amsterdam. Topics to be covered include: the personalisation of search results, recommender systems, product search and data mining in social media and eCommerce. Opening the ECIR 14 will be Eugene Agichtein (Emory University, USA) with a keynote address explaining how the intentions and habits of Internet users can be deduced from their search engine interactions.

Technological innovations

Access to big data and high-cost infrastructures is becoming an increasingly important factor in information retrieval. In a panel discussion, leading names in business and science will shed light on the current state of play and what research in this field has in store. The special industry day on Wednesday, 16 April will open with a keynote address by Gilad Mishne (head of the Twitter search team) on real-time search on Twitter. This will be followed by presentations by various Dutch and international companies, including Yahoo! and eBay, about their own latest technologies.

The Netherlands is one of the pioneers in worldwide scientific research into information retrieval. The ECIR 14 is being organised by the University of Amsterdam’s Intelligent Systems Lab Amsterdam (ISLA) with support from search engine giants such as Microsoft, Yahoo!, Yandex and Google.

Time and location

Time: 09:00 Sunday, 13 April – 17:00 Wednesday, 16 April
Location: Hotel Casa 400, Eerste Ringdijkstraat 4, Amsterdam

Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams

Title Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams
Author David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke
Publication type Full paper
Conference name 36th European Conference on Information Retrieval (ECIR ’14)
Conference location Amsterdam, The Netherlands
Abstract The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.
Full paper PDF [256 KB]

Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams

*Update*

Several Dutch media have picked up our work:

Original press release:

Press coverage:

Slides of my talk at #ECIR2014 are now up on Slideshare;

*Original post*

Our paper “Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams” with Manos Tsagkias, Lars Buitinck, and Maarten de Rijke got accepted as a full paper to ECIR 2014!

Download a pre-print: Graus, D., Tsagkias, E., Buitinck, L., &  de Rijke, M., “Generating pseudo-ground truth for predicting new concepts in social streams,” in 36th European Conference on Information Retrieval (ECIR’14), 2014. [PDF, 258KB]

Abstract

The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.

Layman explanation

This blog post is intended as a high level overview of what we did. Remember my last post on entity linking? In this paper we want to do entity linking on entities that are not (yet) on Wikipedia, or:

Recognizing (finding) and classifying (determining their type: persons, locations or organizations) unknown (not in the knowledge base) entities on Twitter (this is where we want to find them)

These entities might be unknown because they are newly surfacing (e.g. a new popstar that breaks through), or because they are so-called ‘long tail’ entities (i.e. very infrequently occurring entities).

Method

To detect these entities, we generate training data, to train a supervised named-entity recognizer and classifier (NERC). Training data is hard to come by: it is expensive to have people manually label Tweets, and you need enough of these labels to make it work. We automate this processing by using the output of an entity linker to label Tweets. The advantage is this is very cheap and easy to create a large set of training data. The disadvantage is that there might be more noise: wrong labels, or bad tweets that do not contain enough information to learn patterns to recognize the types of entities we are looking for.

To address this latter obstacle, we apply several methods to filter Tweets which we deem ‘nice’. One of these methods involves scoring Tweets based on their noise. We applied very simple features to determine this ‘noise-level’ of a tweet; amongst others how many mentions (@’s), hashtags (#’s) and URLs it contains, but also the ratio between upper-case to lower-case letters, the average word length, the tweet’s length, etc. An example for this Twitter-noise-score is below (these are Tweets from the TREC 2011 Microblog corpus we used):

Top 5 quality Tweets

  1. Watching the History channel, Hitler’s Family. Hitler hid his true family heritage, while others had to measure up to Aryan purity.
  2. When you sense yourself becoming negative, stop and consider what it would mean to apply that negative energy in the opposite direction.
  3. So. After school tomorrow, french revision class. Tuesday, Drama rehearsal and then at 8, cricket training. Wednesday, Drama. Thursday … (c)
  4. These late spectacles were about as representative of the real West as porn movies are of the pizza delivery business Que LOL
  5. Sudan’s split and emergence of an independent nation has politico-strategic significance. No African watcher should ignore this.

Top 5 noisy Tweets

  1. Toni Braxton ~ He Wasnt Man Enough for Me _HASHTAG_ _HASHTAG_? _URL_ RT _Mention_
  2. tell me what u think The GetMore Girls, Part One _URL_
  3. this girl better not go off on me rt
  4. you done know its funky! — Bill Withers “Kissing My Love” _URL_ via _Mention_
  5. This is great: _URL_ via _URL_

In addition, we filter Tweets based on the confidence score of the entity linker, so as not to include Tweets that contain unlikely labels.

Experimental Setup

It is difficult to measure how well we do in finding entities that do not exist on Wikipedia, since we need some sort of ground truth to determine whether we did well or not. As we cannot manually check for 80.000 Tweets whether the identified entities are in or out of Wikipedia, we take a slightly theoretical approach.

If I were to put it in a picture (and I did, conveniently), it’d look like this:

method

In brief, we take small ‘samples’ of Wikipedia: one such sample represent the “present KB”; the initial state of the KB. The samples are created by removing out X% of the Wikipedia pages (from 10% to 90% in steps of 10). We then label Tweets using the full KB (100%) to create the ground truth: this full KB represents the “future KB”. Our “present KB” then labels the Tweets it knows, and uses the Tweets it cannot link as sources for new entities. If then the NERC (trained on the Tweets labeled by the present KB) manages to identify entities in the set of “unlinkable” Tweets, we can compare the predictions to the ground truth, and measure performance.

Results & Findings

We report on standard metrics: Precision & Recall, on two levels: entity and mention level. However, I won’t go into any details here, because I encourage you to read the results and findings in the paper.

ECIR 2014 in Amsterdam!

ECIR 2014 Logo, design by Rutger de Vries/Perongeluk

The Intelligent Systems Lab Amsterdam (ISLA) at the University of Amsterdam has been awarded the hosting of the European Conference on Information Retrieval (ECIR) in 2014. The ‘local’ organization concerning such tasks as arranging a venue, keeping an eye on finances, arranging a social event, arranging accommodation for conference attendants, etc. of this big conference will be largely in hands of me and my fellow PhD candidate-colleagues. In my opinion quite an awesome way of getting some experience in all the aspects involving the organization of such an event.

Within this ECIR2014 ‘local team’, I am trusted with PR/Communication tasks. The first task I’ve done is getting a website up. It lives here: ecir2014.org. The site is designed by Rutger de Vries/Perongeluk and subsequently ever so subtly destroyed into usability by me and colleagues. Another PR task is Twitter, as we have full confidence of Twitter remaining the number one micro-blogging platform in 2014 ;-). So, you know what’s left to do: follow @ECIR2014, like ECIR 2014 & visit ECIR2014.org. And put down in your agenda: April 13th to 17th, 2014. See you there!