David's blog

PhD Candidate Semantic Search in eDiscovery

ECIR 2014 Press Release

Thursday, April 3, 2014

Together with the persvoorlichting of the UvA  we wrote a press release announcing our upcoming conference, check it out below.

English (translated through UvA) follows Dutch (original).

Nieuwe inzichten en ontwikkelingen in zoekmachinetechnologie (link)

European Conference on Information Retrieval

Wat kan een zoekmachine – op basis van wat je zoekt en waar je op klikt – afleiden over je identiteit en gedrag? Hoe kan ‘gamification’ ingezet worden om zoekmachines te verbeteren? En welke rol speelt het verzamelen en toegankelijk maken van verschillende datastromen in de stad van de toekomst? Deze en andere vragen worden beantwoord tijdens de 36e ‘European Conference on Information Retrieval’ (ECIR ‘14).

De conferentie, die dit jaar plaatsvindt van 13 tot en met 16 april in Amsterdam, brengt de internationale top van onderzoekers op het terrein van information retrieval (zoekmachinetechnologie) samen. Aan bod komen onderwerpen als personalisatie van zoekresultaten,recommender systems (aanbevelingssystemen), datamining in sociale media, en eCommerce en product search. Eugene Agichtein (Emory University, VS) opent ECIR ’14 met een keynote waarin hij ingaat op het afleiden van intenties en gedrag van internetgebruikers uit hun interacties met zoekmachines.

Technologische innovaties

Toegang tot (big) data en kostbare infrastructuren worden steeds belangrijker in de information retrieval. In een paneldiscussie belichten prominenten uit zowel het bedrijfsleven als de wetenschap de huidige stand van zaken en toekomstige ontwikkelingen in het onderzoeksveld. De industry day op woensdag 16 april wordt geopend met een keynote door Gilad Mishne (hoofd van het zoekteam van Twitter) over real-time zoeken op Twitter. Vervolgens presenteren (inter)nationale bedrijven, waaronder Yahoo! en eBay, hun laatste technologische innovaties.

Nederland is één van de meest vooraanstaande landen als het gaat om wetenschappelijk onderzoek in de information retrieval. De organisatie van ECIR ‘14 ligt in handen van het Intelligent Systems Lab Amsterdam (ISLA) van de Universiteit van Amsterdam, met ondersteuning van onder meer zoekgiganten als Microsoft, Yahoo!, Yandex en Google.


Hotel Casa 400
Eerste Ringdijkstraat 4
1097 BC Amsterdam

New insights and developments in search engine technology (link)

European Conference on Information Retrieval

What can a search engine deduce about your identity and habits based on the topics you search and select? How can gamification be used to improve search engines? And what role will the collection and provision of access to diverse data flows play in the city of the future? These are just a few of the questions to be addressed during the 36th European Conference on Information Retrieval (ECIR 14).

Set to take place on 13-16 April, the conference will bring international frontrunners in the field of information retrieval (search engine technology) together in Amsterdam. Topics to be covered include: the personalisation of search results, recommender systems, product search and data mining in social media and eCommerce. Opening the ECIR 14 will be Eugene Agichtein (Emory University, USA) with a keynote address explaining how the intentions and habits of Internet users can be deduced from their search engine interactions.

Technological innovations

Access to big data and high-cost infrastructures is becoming an increasingly important factor in information retrieval. In a panel discussion, leading names in business and science will shed light on the current state of play and what research in this field has in store. The special industry day on Wednesday, 16 April will open with a keynote address by Gilad Mishne (head of the Twitter search team) on real-time search on Twitter. This will be followed by presentations by various Dutch and international companies, including Yahoo! and eBay, about their own latest technologies.

The Netherlands is one of the pioneers in worldwide scientific research into information retrieval. The ECIR 14 is being organised by the University of Amsterdam’s Intelligent Systems Lab Amsterdam (ISLA) with support from search engine giants such as Microsoft, Yahoo!, Yandex and Google.

Time and location

Time: 09:00 Sunday, 13 April – 17:00 Wednesday, 16 April
Location: Hotel Casa 400, Eerste Ringdijkstraat 4, Amsterdam

Information Retrieval at LegalTech 2014

Thursday, February 6, 2014

Thanks to the kind lady at the registration desk I had the unexpected honor of representing the beautiful former Carribean country of the Netherlands Antilles at LegalTech 2014, the self-proclaimed largest and most important legal technology event of the year.

David Graus at LegalTech

LegalTech is an “industry conference” where attorneys, lawyers, and IT people meet up and discuss the current and future state of law and IT. Product vendors show their software and tools aimed at making the life of the modern-day attorney easier. As I work on semantic search in eDiscovery, my reasons to attend (being generously invited by Jason Baron) were;

  1. To get a better overview and understanding of eDiscovery (in the US).
  2. To see what people consider the ‘future’ or important topics within eDiscovery.
  3. To understand what the current state of the art is in tools and applications.
  4. (To plug semantic search)

Indeed, in summary, to retrieve information! (As an IR researcher does). The conference included keynotes, conference tracks, panel discussions and a huge exhibitor show where over 100 vendors of eDiscovery-related software present their products. All this fits on just three floors of the beautiful Hilton Midtown Hotel in the middle of New York.

To get a feel of the topics and themes, tracks titles included a.o. eDiscovery, Transforming eDiscovery, Big Data, Information Governance, Advanced IT, Technology in Practice, Technology and Trends Transforming the Legal World, Corporate Legal IT.


LegalTech is a playground for attorneys and lawyers, not so much PhD students who work on information extraction and semantic search. Needless to say I was far from the typical attendant (possibly the most atypical). But LegalTech proved to be an informative and valuable crash course in eDiscovery for me (I think I can tick the boxes of all 4 of the aforementioned reasons for attending).


The keynotes allowed me to get a better understanding of eDiscovery (a.o., through hearing some of the founders of the eDiscovery world), the panel discussions were very useful in getting an understanding of the open problems, challenges and future directions, and finally the trade show allowed me to get a very complete overview of what is being built and used right now in terms of eDiscovery-supporting software.

I had varying success of talking to vendors about the stuff I was interested in: technology and algorithms behind tools, and choices for including or excluding certain features and functionalities. More frequently than not would an innocently nerdy question from my part be turned around into a software salespitch. To be fair, these people were here to sell, or at least show, so this is hardly unexpected.

The tracks: my observations

During the different tracks and panel discussions I attended, I noticed a couple of things. This is by no means a complete overview of the current things that matter in eDiscovery, but just a personal report of the things I found interesting or noteworthy;

Some of the “open door” recurring themes revolved around the “man vs machine”-debate, trust in algorithms, balance in computer assisted review vs manual review, the intricacies of algorithm performance measurement, and where Moore’s law will bring the law world in 5-10 years. Highly relevant issues for attorneys, lawyers and eDiscovery vendors, but things that I take for granted, and consider the starting point (default win for algorithms!). However, it seems like this is a debate that is not yet settled in this domain, it also seems that while everyone accepts computer assisted review as the unavoidable future, it seems still unclear what this unavoidable future exactly will look like.

On multiple occasions I heard video and image retrieval being mentioned as important future directions for eDiscovery (good news for some colleagues at the University of Amsterdam down the hall). Also, the challenge of privacy and data ownership in a mobile world, where enterprise and personal data are mixed and spread out across iPads, smartphones, laptops and clouds, were identified as major future hurdles.

Finally, in the session titled “Have we Reached a “John Henry” Moment in Evidentiary Search?” the panelists (which included Jason Baron and Ralph Losey) touched upon using eDiscovery tools and algorithms for information governance. Currently, methods are being developed to detect, reconstruct, classify or find events of interest: after the fact. Couldn’t these be used in a predictive setting, instead of a retro-spective one; learning to predict bad stuff before it happens. Interesting stuff.

The tradeshow: metadata-heavy


What I noticed particularly at the trade show was that there was a large overlap both in tools’ functionality and features and their looks and designs. But what I found more striking is the heavy focus on metadata. The tools typically use metadata such as timestamps, authors, and document types to allow users to drill down through a dataset, filtering for time periods, keywords, authors, or a combination of all of these.

Visualizations a plenty, with the most frequent ones being Google Ngrams-ish keyword histograms, and networks (graphs) of interactions between people. What was shocking for an IR/IE person like myself is that typically, once a user is done drilling down to a subset of document, he is designated to prehistoric keyword search to explore and understand the content of the set of documents. Oh no!

But for someone who’s spending 4 years of his life to enabling semantic search in this domain this isn’t worrying, but rather promising! After talking to vendors I learned that plenty of them are interested in these kind of features and functionalities, so there is definitely room for innovation here. (However to be fair, whether the target users agree might be another question).


Anyway, this ‘metadata heaviness’ is obviously a gross oversimplification and generalization, and there were definitely some interesting companies that stood out for me. Here’s a small, incomplete, and biased summary;

  • I had some nice talks with the folks at CatalystSecure, who’s senior applied research scientist and former IR academic (dr. Jeremy Pickens) was the ideal companion to be unashamedly nerdy with, talking about classification performance metrics, challenges in evaluating the “whole package” of the eDiscovery process, and awesome datasets.
  • RedOwl Analytics do some very impressive stuff with behavioural analytics, where they collect statistics for each ‘author’ in their data (such as number of emails sent and received, ‘time to respond’, number of times cc’ed), to get an ‘average baseline’ of a single dataset (enterprise), that they can use to recognize individuals who deviate from this average. The impressive part was that they were then able to map these deviations to behavioural traits (such as ‘probability of an employee leaving a company’, or on the other side of the spectrum identifying the ‘top employees’ that otherwise remain under the radar). How that works under the hood remains a mystery for me, but the type of questions they were able to answer in the demo were impressive.
  • Recommind‘s CORE platform seems to rely heavily on topic modeling, and was able to infer topics from datasets. In doing so, Recommind shows we can indeed move beyond keyword search in a real product (and outside of academic papers ;-) ). This doesn’t come as a surprise, seeing that Recommind’s CTO dr. Jan Puzicha is of probabilistic latent semantic indexing (/analysis) fame.

What’s next?

As I hinted at before, I’m missing some more content-heavy functionalities, e.g., (temporal) entity and relation extraction, identity normalization, maybe (multi document) summarization? Conveniently, this is exactly what me and my group are working on! I suppose the eDiscovery world just doesn’t know what they’re missing, yet ;-).

Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams

Tuesday, December 3, 2013

Our paper “Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams” with Manos Tsagkias, Lars Buitinck, and Maarten de Rijke got accepted as a full paper to ECIR 2014!

Download a pre-print: Graus, D., Tsagkias, E., Buitinck, L., &  de Rijke, M., “Generating pseudo-ground truth for predicting new concepts in social streams,” in 36th European Conference on Information Retrieval (ECIR’14), 2014. [PDF, 258KB]


The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.

Layman explanation

This blog post is intended as a high level overview of what we did. Remember my last post on entity linking? In this paper we want to do entity linking on entities that are not (yet) on Wikipedia, or:

Recognizing (finding) and classifying (determining their type: persons, locations or organizations) unknown (not in the knowledge base) entities on Twitter (this is where we want to find them)

These entities might be unknown because they are newly surfacing (e.g. a new popstar that breaks through), or because they are so-called ‘long tail’ entities (i.e. very infrequently occurring entities).


To detect these entities, we generate training data, to train a supervised named-entity recognizer and classifier (NERC). Training data is hard to come by: it is expensive to have people manually label Tweets, and you need enough of these labels to make it work. We automate this processing by using the output of an entity linker to label Tweets. The advantage is this is very cheap and easy to create a large set of training data. The disadvantage is that there might be more noise: wrong labels, or bad tweets that do not contain enough information to learn patterns to recognize the types of entities we are looking for.

To address this latter obstacle, we apply several methods to filter Tweets which we deem ‘nice’. One of these methods involves scoring Tweets based on their noise. We applied very simple features to determine this ‘noise-level’ of a tweet; amongst others how many mentions (@’s), hashtags (#’s) and URLs it contains, but also the ratio between upper-case to lower-case letters, the average word length, the tweet’s length, etc. An example for this Twitter-noise-score is below (these are Tweets from the TREC 2011 Microblog corpus we used):

Top 5 quality Tweets

  1. Watching the History channel, Hitler’s Family. Hitler hid his true family heritage, while others had to measure up to Aryan purity.
  2. When you sense yourself becoming negative, stop and consider what it would mean to apply that negative energy in the opposite direction.
  3. So. After school tomorrow, french revision class. Tuesday, Drama rehearsal and then at 8, cricket training. Wednesday, Drama. Thursday … (c)
  4. These late spectacles were about as representative of the real West as porn movies are of the pizza delivery business Que LOL
  5. Sudan’s split and emergence of an independent nation has politico-strategic significance. No African watcher should ignore this.

Top 5 noisy Tweets

  1. Toni Braxton ~ He Wasnt Man Enough for Me _HASHTAG_ _HASHTAG_? _URL_ RT _Mention_
  2. tell me what u think The GetMore Girls, Part One _URL_
  3. this girl better not go off on me rt
  4. you done know its funky! — Bill Withers “Kissing My Love” _URL_ via _Mention_
  5. This is great: _URL_ via _URL_

In addition, we filter Tweets based on the confidence score of the entity linker, so as not to include Tweets that contain unlikely labels.

Experimental Setup

It is difficult to measure how well we do in finding entities that do not exist on Wikipedia, since we need some sort of ground truth to determine whether we did well or not. As we cannot manually check for 80.000 Tweets whether the identified entities are in or out of Wikipedia, we take a slightly theoretical approach.

If I were to put it in a picture (and I did, conveniently), it’d look like this:


In brief, we take small ‘samples’ of Wikipedia: one such sample represent the “present KB”; the initial state of the KB. The samples are created by removing out X% of the Wikipedia pages (from 10% to 90% in steps of 10). We then label Tweets using the full KB (100%) to create the ground truth: this full KB represents the “future KB”. Our “present KB” then labels the Tweets it knows, and uses the Tweets it cannot link as sources for new entities. If then the NERC (trained on the Tweets labeled by the present KB) manages to identify entities in the set of “unlinkable” Tweets, we can compare the predictions to the ground truth, and measure performance.

Results & Findings


p>We report on standard metrics: Precision & Recall, on two levels: entity and mention level. However, I won’t go into any details here, because I encourage you to read the results and findings in the paper.

What to post next?

Monday, December 2, 2013

Hello, an awesome day to all of you. Adam is offering me a guestpost. What to pick, what to pick?


Hope you are having an awesome day!

I would like to express my interest to submit a compelling  guest post.

All our articles are visually appealing & written with care and love and detailed research.

To get a feel for how I write, see the below posts I did for some quite authoritative sites:

- http://www.www.some_random_website.com/2013/09/17/apple-introduces-fresh-iphone-models/
- http://www.some_random_website.com/social-media/2013/09/17/how-to-optimize-your-landing-pages-for-facebook-traffic/
- http://www.some_random_website.com/internet-marketing/url-shorteners.html

Here are the articles I have available as of now.


??? This sheet is updated daily at 8am so you can check back anytime and request more articles ANYTIME ???

Which one would you be most interested in?


Adam Prattler

Disclaimer & Important rules:

All content remains the sole property of Adam’s website and is only lend to you on condition of our link being placed in either the author bio or body.

My reply:

Thanks for expressing your interest in submitting a compelling guest post. It has been duly noted!

You have an awesome day too!


SEO just got more scary.

yourHistory – Entity linking for a personalized timeline of historic events

Saturday, September 14, 2013
1 comment

Download a pre-print of Graus, D., Peetz, M-H., Odijk, D., de Rooij, Ork., de Rijke, M. “yourHistory — Semantic linking for a personalized timeline of historic events,” in CEUR Workshop Proceedings, 2014.

Update #1

I presented yourHistory at ICT.OPEN 2013:


The slides of my talk are up on SlideShare:

yourHistory – entity linking for a personalized timeline of historic events from David Graus

And we got nominated for the “Innovation & Entrepreneurship Award” there! (sadly, didn’t win though ;) ).


Original Post

yourHistory - OKConference poster

For the LinkedUp Challenge Veni competition at the Open Knowledge Conference (OKCon), we (Maria-Hendrike Peetz, me, Daan Odijk, Ork de Rooij and Maarten de Rijke) created yourHistory; a Facebook app that uses entity linking for personalized historic timeline generation (using d3.js). Our app got shortlisted (top 8 out of 22 submissions) and is in the running for the first prize of 2000 euro!

Read a small abstract here:

In history we often study dates and events that have little to do with our own life. We make history tangible by showing historic events that are personal and based on your own interests (your Facebook profile). Often, those events are small-scale and escape history books. By linking personal historic events with global events, we to link your life with global history: writing your own personal history book.

Read the full story here;

And try out the app here!

It’s currently still a little rough around the edges. There’s an extensive to-do list, but if you have any feedback or remarks, don’t hesitate to leave me a message below!

Please vote for “cyclodrivers at work”

Saturday, June 22, 2013


#ilps wordle

Wednesday, June 19, 2013

Top 50 most frequently used words in our group’s private IRC channel, collected between 2013-01-28 and today. The chat has been filtered for nicknames and common stopwords in both Dutch and English.

To be expected: a mix of Dutch and English, the words lunch, coffee, paper, people, uva… did anyone say beer?


How many things took place between 1900 and today? DBPedia knows

Friday, June 7, 2013

For a top-secret project, I am looking at retrieving all entities that represent a ‘(historic) event’, from DBPedia.

Now I could rant about how horrible it is to actually formulate a ‘simple’ query like this, using the structured anarchistic Linked Data format, so I will: this request “give me all entities that represent ‘events’ from DBPedia” takes me 3 SPARQL queries, since different predicates represent the same thing, but probably I need a lot more to get a proper subset of the entities I’m looking for. Currently, I filter for entities that have a dbpedia-owl:date property, a dbprop:date property (yes, these predicated express the exact same property) and entities that belong to the Event class.

Anyway, if we count for each year how many event entities there are, we get the following graph:


Which is interesting, because it shows how there are loads of events in the near past, and around WWII, and around WWI. I could now say something about how interesting it is that our collective memory is focused on the near past, but then I looked at the events and saw loads of sports events, so I won’t, but rather say that back in the days we were terrible at organizing sports events. Still, the knowledge that between 1900 and today a total of 16.589 events happened seems significant to me.


Monday, June 3, 2013


We won the WoLE2013 Challenge

Tuesday, May 14, 2013

With our SemanticTED demo, we (Daan OdijkEdgar Meij, Tom Kenter and me) won the Web of Linked Entities 2013 Workshop’s “Doing Good by Linking Entities” Developers Challenge (at WWW2013).

Read the paper of our submission here:

Now we get to share an iPad.