My PhD thesis,Ā Entities of Interest — Discovery in Digital Traces is now available for download. Click on the cover below to head to graus.nu/entities-of-interest and grab yourĀ electronic copy of the little booklet that took me 4+ years to write!

My PhD thesis,Ā Entities of Interest — Discovery in Digital Traces is now available for download. Click on the cover below to head to graus.nu/entities-of-interest and grab yourĀ electronic copy of the little booklet that took me 4+ years to write!
Our paper,
@inproceedings{graus2016analyzing,
author = {Graus, David and Bennett, Paul N. and White, Ryen W. and Horvitz, Eric},
title = {Analyzing and Predicting Task Reminders},
year = {2016},
isbn = {9781450343688},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2930238.2930239},
doi = {10.1145/2930238.2930239},
booktitle = {Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization},
pages = {7ā15},
numpages = {9},
keywords = {prospective memory, reminders, log studies, intelligent assistant},
location = {Halifax, Nova Scotia, Canada},
series = {UMAP '16}
}
was awarded best student paper, at UMAP 2016!
Update (16/07): This paper was awarded the James Chen Best Student Paper Award at UMAP!
Automated personal assistants such as Google Now, Microsoft Cortana, Siri, M and Echo aid users in productivity-related tasks, e.g., planning, scheduling and reminding tasks or activities. In this paper we study one such feature of Microsoft Cortana: user-created reminders. Reminders are particularly interesting as theyĀ represent theĀ tasks that people are likely to forget. Analyzing and betterĀ understanding the nature ofĀ these tasksĀ could proveĀ useful in inferring the user’s availability, aid in developing systems to automatically terminate ongoing tasks, allocate time for task completion, or pro-actively suggest (follow-up) tasks.
@inproceedings{graus2016analyzing,
author = {Graus, David and Bennett, Paul N. and White, Ryen W. and Horvitz, Eric},
title = {Analyzing and Predicting Task Reminders},
year = {2016},
isbn = {9781450343688},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2930238.2930239},
doi = {10.1145/2930238.2930239},
booktitle = {Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization},
pages = {7ā15},
numpages = {9},
keywords = {prospective memory, reminders, log studies, intelligent assistant},
location = {Halifax, Nova Scotia, Canada},
series = {UMAP '16}
}
Studying things that people tend to forget has a rich history in the field of social psychology. This type of memory is called “Prospective memory” (or more poetically written: “Remembrance of Things Future“). One challenge in studying PM is that its hard to simulate in a lab study (the hammer of choice for social psychologists). For this reason, most studies of PM have been restricted to “event-based” PM, i.e., memories triggered by an event, modeled in a lab through having someone perform a mundane task, and doing a special thing upon being triggered by an event. Furthermore, the focus in these studies has largely been on retention and retrieval performance of “artificial” memories: subjects were typically given an artificial task to perform. Little is known about the type and nature of actual, real-world, “self-generated” tasks.
Enter Cortana. The user logs we study in this paper represent a rich collection of real-life, actual, self-generated, time-based PM instances, collected in the wild. Studying them in aggregate allows us to better understand the type of tasks that people remind themselves about.
(Yes, sorry, that heading really says big data…)
As the loyal reader may have guessed, this paper is the result of my internship at Microsoft Research last summer, and one of the (many) advantages of working at Microsoft Research is the restricted access to big and beautiful data. In this paper we analyze 576,080 reminders, issued by 92,264 people over a period of two months (and we later do prediction experiments on 1.5M+ reminders over a six month time period). Note that this is a filtered set of reminders (a.o. for a smaller geographic area, and we removed all users that only issued a few reminders). Furthermore, when analyzing particular patterns, we filter data to patterns commonly observed across multiple users to study behavior in aggregate and further preserve user privacy: we are not looking at the users behavior at the individual level, but across a large population, to uncover broad and more general patterns. So what do we do to these reminders? The paper consists of three main parts;
1. Task type taxonomy: First, we aim to identify common types of tasks that underlie reminder setting, by studying the most common reminders found in the logs. This analysis is partly data-driven, and partly qualitative; as we are interested in ‘global usage patterns,’ we extract common reminders, defined as reminders that are seen across many users, that contain a common ‘action’ or verb. We do so by identifying the top most common verb phrases (and find 52 verbs that cover ~61% of the reminders in our logs), and proceed by manually labeling them into categories.
2. Temporal patterns: Next, we study temporal patterns of reminders, by looking at correlations between reminder creation and notification, and in temporal patterns for the terms in the reminder descriptions. We study two aspects of these temporal patterns: patterns in when we create and execute reminders (as a proxy to when people typically tend to think about/execute certain tasks), and the duration of the delay between the reminder’s creation and notification (as a proxy to how “far in advance” we tend to plan different things).
3. Predict! Finally, we show how the patterns we identify above generalize, by addressing the task of predicting the day at which a reminder is likely to trigger, given its creation time and the reminder description (i.e., terms). Understanding when people tend to perform certain tasks could be useful for better supporting users in the reminder process, including allocating time for task completion, or pro-actively suggesting reminder notification times, but also for understanding behavior at scale by looking at patterns in reminder types.
As always, no exhaustive summary of the paper point-by-point here, straight into some of our findings (there’s much more in the paper):
Want to know more? See the taxonomy? See more pretty plots? Look at some equations? Learn how this could improve intelligent assistants? Read the paper!
@inproceedings{graus2016analyzing,
author = {Graus, David and Bennett, Paul N. and White, Ryen W. and Horvitz, Eric},
title = {Analyzing and Predicting Task Reminders},
year = {2016},
isbn = {9781450343688},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2930238.2930239},
doi = {10.1145/2930238.2930239},
booktitle = {Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization},
pages = {7ā15},
numpages = {9},
keywords = {prospective memory, reminders, log studies, intelligent assistant},
location = {Halifax, Nova Scotia, Canada},
series = {UMAP '16}
}
Read a pre-print of our paper below:
@inproceedings{graus2016dynamic,
author = {Graus, David and Tsagkias, Manos and Weerkamp, Wouter and Meij, Edgar and de Rijke, Maarten},
title = {Dynamic Collective Entity Representations for Entity Ranking},
year = {2016},
isbn = {9781450337168},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2835776.2835819},
doi = {10.1145/2835776.2835819},
booktitle = {Proceedings of the Ninth ACM International Conference on Web Search and Data Mining},
pages = {595ā604},
numpages = {10},
keywords = {fielded retrieval, entity retrieval, entity ranking, content representation},
location = {San Francisco, California, USA},
series = {WSDM '16}
}
In our latest paper we study the problem of entity ranking. In search engines, people often search for entities; real-life “things” (people, places, companies, movies, etc.). Google, Bing, Yahoo, DuckDuckGo, all big web search engines cater to this type of information need by displaying knowledge panels (they go by many names; but little snippets that show a summary of information related to an entity). You’ve seen this before, but if you haven’t, see the picture below;
One challenge in giving people the entities they search for is that of vocabulary mismatch; people use many different ways to search for entities. Well-formed queries like “Kendrick Lamar” may be a large chunk, but just as well, you’ll find people searching for “k.dot,” or even more abstract/descriptive queries when users do not exactly remember the name of who they are looking for.
Another example is when events unfold in the real world, e.g., Michael Brown being killed by cops in Ferguson. As soon as this happens, and news media starts reporting it, people may start looking for relevant entities (Ferguson) by searching for previously unassociated words, e.g., “police shooting missouri.”
A final example (also in our paper) is shown below. The entity Anthropornis has a small and matter-of-factual description on Wikipedia (it is a stub);
But on Twitter, Brody Brooks refers to this particular species of penguin in the following way;
Baddest motherfucking penguin there ever was. http://t.co/WBACdddL
ā Brody Brooks (@BrodyBr) December 8, 2012
While putting profanity in research papers is not greatly appreciated, this tweet does illustrate our point: people do refer to entities in different (and rich!) ways. The underlying idea of our method is to leverage this for free, to close the gap between the vocabulary of people, and the (formal) language of the Knowledge Base. More specifically, the idea is to enable search engines to automagically incorporate changes in search behavior for entities (“police shooting + ferguson”), and different ways in how people refer to entities (bad penguins).
So how? We propose to “expand” entity descriptions by mining content from the web. I mean add words to documents to make it easier to find the documents. We collect these words from tweets, social tags, web anchors (links on webpages), and search engine queries, all of which are somehow associated with entities. So in the case of our Anthropornis-example, the next time someone were to search for the “baddest penguin there ever was,” Anthropornis will get ranked higher.
These type of methods (document expansion) have been studied before, but what sets our setting apart from previous work are two things;
As usual, I won’t go into the nitty gritty details of our experimental setup, modeling and results in this post. Read the paper for that (actually, the experimental setup details are quite nitty and gritty in this case). Let’s cut to the chase: adding external descriptions to your entity representation improves entity ranking effectiveness (badum-tss)!
Furthermore, it is important to assign individual weights to the different sources, as the sources vary a lot in terms of content (tweets and queries differ in length, quality, etc.). The expansions also vary across different entities (popular entities may receive many expansions, where less popular entities may not). To balance this, we inform the ranker of the number of expansions a certain entity has received. We address all the above issues by proposing different features for our machine learning model. Finally, we show that in our dynamic scenario, it is a good idea to (periodically) retrain your ranker to re-assess these weights.
What I find attractive about our method is that it’s relatively “cheap” and simple; you simply add content (= words) to your entity representation (= document) and retrieval improves! Even if you omit the fancy machine learning re-training (detailed in our paper). Anyway, for the full details, and more pretty plots like this one, do read our paper!
@inproceedings{graus2016dynamic,
author = {Graus, David and Tsagkias, Manos and Weerkamp, Wouter and Meij, Edgar and de Rijke, Maarten},
title = {Dynamic Collective Entity Representations for Entity Ranking},
year = {2016},
isbn = {9781450337168},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2835776.2835819},
doi = {10.1145/2835776.2835819},
booktitle = {Proceedings of the Ninth ACM International Conference on Web Search and Data Mining},
pages = {595ā604},
numpages = {10},
keywords = {fielded retrieval, entity retrieval, entity ranking, content representation},
location = {San Francisco, California, USA},
series = {WSDM '16}
}
Additionally, you can check out the slides of a talk I gave on this paper at DIR 2015, or check out the poster I presented there.
Our paper “Dynamic Collective Entity Representations for Entity Ranking,” with Manos Tsagkias, Wouter Weerkamp, Edgar Meij and Maarten de Rijke was accepted at The 9th ACM International Conference on Web Search and Data Mining (WSDM2016). Read the extended one-page abstract (submitted to DIR 2015) here (PDF, 200kb).
Abstract: Entity ranking, i.e., successfully positioning a relevant entity at the top of the ranking for a given query, is inherently difficult due to the potential mismatch between the entityās description in a knowledge base, and the way people refer to the entity when searching for it. To counter this issue we propose a method for constructing dynamic collective entity representations. We collect entity descriptions from a variety of sources and combine them into a single entity representation by learning to weight the content from different sources that is associated with an entity for optimal retrieval effectiveness. Our method is able to add new descriptions in real time, and learn the best representation at set time intervals as time evolves so as to capture the dynamics in how people search entities. Incorporating dynamic description sources into dynamic collective entity representations improves retrieval effectiveness by 7% over a state-of-the-art learning to rank baseline. Periodic retraining of the ranker enables higher ranking effectiveness for dynamic collective entity representations.
I will post a pre-print here soon.
Update: Cool! Our paper has been selected for presentation as a long talk at the conference.
Update 2: The extended abstract of this paper has been accepted for poster + oral presentation at the 14th Dutch-Belgian Information Retrieval Workshop (DIR 2015). I’ve uploaded the slides of my DIR talk here.
Friday December 12th I’ll be giving a talk on our Understanding Email Traffic work, at the Frontiers of Forensic Science Lecture Series
When? Friday December 12th, 15:00 – 18:00
Where? Science Park 904, C0.05
Click the flyer for more information
In ourĀ paper “Recipient recommendation in enterprises using communication graphs and email content“ we study email traffic, byĀ looking into recipient recommendation, or: given an email without recipients, can we predict to whom it should be sent? SuccessfullyĀ predicting thisĀ helps inĀ understanding the underlying mechanics and structure of an email network. To model this prediction task we consider the email traffic as a network, or graph, where each unique email account (user) corresponds to a node, and edgesĀ correspond to emails sent between users (see e.g., Telecommunications network on Wikipedia).
Google does recipient recommendation (in Gmail) by considering a user’s so-called egonetwork, i.e., a single user’s previously sent and received emails. When you frequently email Alan and Bob jointly, Gmail (might) suggest you to include Alan when you compose a new message to Bob. This approach only considers previous interactions between you and others (restricted to the egonetwork), and ignores signals such as the content of an email. This means that Gmail can only start recommending users once you’ve addressed at least one recipient (technically, this isn’t recipient recommendation, but rather “CC prediction”).
We decided to see what we can do if we consider all information available in the network, i.e., both the full communication graph (beyond the user’s ego-network), and the content of all emails present in the network (intuition: if you write a message with personal topics, the intended recipient is more likely to be a friend than a coworker). In short, this comes down to combining;
We modelĀ the task of recommending recipientsĀ as thatĀ of ranking users. Or,Ā given a sender (you) and an email (the email you wrote) the task is to rank highestĀ thoseĀ users in the network that are most likely to receive your email. This ranking should happen in a streaming setting, where we update all models (language and network) for eachĀ new email that is being sent (so that we do not use “future emails” in predicting the recipients). ThisĀ means that theĀ network and languageĀ models change over time, andĀ adapt to changes in language use, topics being discussed, but also the ‘distance’ between users in the network.
We use a generative model to rank recipients, by estimating the probability of observing a recipient (R), given an email (E) and sender (S);
If you don’t get this, don’t worry, in human language this reads as: the probability (P) of observing recipient R, given sender S and email E. We compute this probability for each pair of users in the network, and rank the resulting probabilities to find the most likely sender & recipient pair.
In this ranking function, we consider three components to estimate this probability (see our paper for how we use Bayes’ Theorem to end up with this final ranking function). One corresponds to the email content, the other two correspond to the SNA properties;
The first component (, reads: probability of observing Email E, given sender S and recipient R) leverages email content, and corresponds to the email likelihood (i.e., how likely it is for email E to be generated by the interpersonal language model (explained below) of S and R). For each user in the network we generate language models, which allows us to compare and combine communication between users in different ways. We thus model, e.g.:
Finally, using these different language models, we model interpersonal language models, or the communication between two users (taking all email traffic between user A and user B). See the picture below for an illustration of these different language models.
Using this method of modeling email communications can be applied for more cool things that we didn’t fully explore for this paper, e.g., finding users that use significantly different language from the rest, by comparing how much a user’s incoming, outgoing or joint LM differs from the corpus LM. Or comparing the interpersonal LM’s that are associated with a single user, to identify a significantly different one (imagine comparing your emails with coworkers to those with your boyfriend/girlfriend/spouse). Future work! (?)
The second component (, reads: probability of observing sender S given recipient R) corresponds to the closeness of sender S and candidate recipient R, in SNA terms. We explore two approaches to estimating this closeness; (1) how many times S and R co-occur in an email (i.e., are addressed together), and (2) the number of emails sent between S and R.
The third and final component (, reads: probability of observing recipient R) corresponds to the prior probability of observing candidate recipient R (i.e., how likely is it for R to receive any email at all?). We model this by (1) counting the number of emails R has received, and (2) the PageRank score of R (favoring ‘important’ recipients).
We use the notorious Enron email corpus to find the best methods to estimate our components. Then, we use a very new, and soon-to-be-released Avocado corpus to evaluate our model. In brief, I won’t go into detail of our experiments (see the paper for those!), but suffice to say that we compare the effectiveness of the email content (LM) component and the social network analysis (SNA) components. There are several findings worth mentioning:
The solution for the two ‘issues’ (2nd and 3rd bullet) is to incorporate time in our models, e.g. by introducing a decay in the language modeling (older emails become less important), and edge weights in the SNA components (older interactions count less than recent ones).
Got it? Read the paper for the full story! (PDF here)
Thanks to the kind lady at the registration desk I had the unexpected honor of representing the beautiful former Carribean country of theĀ Netherlands AntillesĀ at LegalTech 2014, theĀ self-proclaimedĀ largest and most important legal technology event of the year.
LegalTech is an “industry conference” where attorneys, lawyers, and IT people meet up and discuss the current and future state of law and IT. Product vendors show their software and tools aimed at making the life of the modern-day attorney easier.Ā As I work on semantic search in eDiscovery, my reasons to attend (being generously invited by Jason Baron) were;
Indeed, in summary, to retrieve information! (As an IR researcher does). The conference included keynotes, conference tracks, panel discussions and a huge exhibitor show where over 100 vendors of eDiscovery-related software present their products. All this fits on just three floors of the beautiful Hilton Midtown Hotel in the middle of New York.
To get a feel of the topics and themes, tracks titles included a.o. eDiscovery, Transforming eDiscovery, Big Data, Information Governance, Advanced IT, Technology in Practice, Technology and Trends Transforming the Legal World, Corporate Legal IT.
LegalTech is a playground for attorneys and lawyers, not so much PhD students who work on information extraction and semantic search. Needless to say I was far from the typical attendant (possibly the most atypical). But LegalTech proved to be an informative and valuable crash course in eDiscovery for me (I think I can tick the boxes of all 4 of the aforementioned reasons for attending).
The keynotes allowed me to get a better understanding of eDiscovery (a.o., through hearing some of the founders of the eDiscovery world), the panel discussions were very useful in getting an understanding of the open problems, challenges and future directions, and finally the trade show allowed me to get a very complete overview of what is being built and used right now in terms of eDiscovery-supporting software.
I had varying success of talking to vendors about the stuff I was interested in: technology and algorithms behind tools, and choices for including or excluding certain features and functionalities. More frequently than not would an innocently nerdy question from my part be turned around into a software salespitch.Ā To be fair, these people were here to sell, or at least show, so this is hardly unexpected.
During the different tracks and panel discussions I attended, I noticed a couple of things. This is by no means a complete overview of the current things that matter in eDiscovery, but just a personal report of the things I found interesting or noteworthy;
Some of the “open door” recurring themes revolved around the “man vs machine”-debate, trust in algorithms, balance in computer assisted review vs manual review, the intricacies of algorithm performance measurement, and where Moore’s law will bring the law world in 5-10 years. Highly relevant issues for attorneys, lawyers and eDiscovery vendors, but things that I take for granted, and consider the starting point (default win for algorithms!). However, it seems like this is a debate that is not yet settled in this domain, it also seems that while everyone accepts computer assisted review as the unavoidable future, it seems still unclear what this unavoidable future exactly will look like.
On multiple occasions I heard video and image retrieval being mentioned as important future directions for eDiscovery (good news for some colleagues at the University of Amsterdam down the hall). Also, the challenge of privacy and data ownership in a mobile world, where enterprise and personal data are mixed and spread out across iPads, smartphones, laptops and clouds, were identified as major future hurdles.
Finally, in the session titled “Have we Reached a “John Henry” Moment in Evidentiary Search?” the panelists (which included Jason Baron and Ralph Losey)Ā touched upon using eDiscovery tools and algorithms for information governance. Currently, methods are being developed to detect, reconstruct, classify or find events of interest: after the fact. Couldn’t these be usedĀ in a predictive setting, instead of a retro-spective one; learning to predict bad stuff before it happens. Interesting stuff.
What I noticed particularly at the trade show was that there was a large overlap both in tools’ functionality and features and their looks and designs. But what I found more striking is the heavy focus on metadata. The tools typically use metadata such as timestamps, authors, and document types to allow users to drill down through a dataset, filtering for time periods, keywords, authors, or a combination of all of these.
Visualizations a plenty, with the most frequent ones being Google Ngrams-ish keyword histograms, and networks (graphs) of interactions between people. What was shocking for an IR/IE person like myself is that typically, once a user is done drilling down to a subset of document, he is designated to prehistoric keyword search to explore and understand the content of the set of documents. Oh no!
But for someone who’s spending 4 years of his life to enabling semantic search in this domain this isn’t worrying, but rather promising! After talking to vendors I learned that plenty of them are interested in these kind of features and functionalities, so there is definitely room for innovation here. (However to be fair, whether the target users agree might be another question).
Anyway,Ā this ‘metadata heaviness’ is obviously a gross oversimplification and generalization, and there were definitely some interesting companies that stood out for me. Here’s a small, incomplete, and biased summary;
As I hinted at before, I’m missing some more content-heavy functionalities, e.g., (temporal) entity and relation extraction, identity normalization, maybe (multi document) summarization? Conveniently, this is exactly what me and my group are working on! I suppose the eDiscovery world just doesn’t know what they’re missing, yet ;-).
Our paper “Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams” with Manos Tsagkias, Lars Buitinck, and Maarten de Rijke got accepted as a full paper to ECIR 2014! See a preprint here:
@inproceedings{graus2014generating,
author={Graus, David and Tsagkias, Manos and Buitinck, Lars and de Rijke, Maarten},
title={Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams},
booktitle={Advances in Information Retrieval},
year={2014},
publisher={Springer International Publishing},
address={Cham},
pages={286--298},
url={https://doi.org/10.1007/978-3-319-06028-6_24},
doi={10.1007/978-3-319-06028-6_24},
series = {ECIR '14}
}
This blog post is intended as a high level overview of what we did. Remember my last post on entity linking? In this paper we want to do entity linking on entities that are not (yet) on Wikipedia, or:
Recognizing (finding) and classifying (determining their type: persons, locations or organizations) unknown (not in the knowledge base) entities on Twitter (this is where we want to find them)
These entities might be unknown because they are newly surfacing (e.g. a new popstar that breaks through), or because they are so-called ‘long tail’ entities (i.e. very infrequently occurring entities).
To detect these entities, we generate training data, to train a supervised named-entity recognizer and classifier (NERC). Training data is hard to come by: it is expensive to have people manually label Tweets, and you need enough of these labels to make it work. We automate this processing by using the output of an entity linker to label Tweets. The advantage is this is very cheap and easy to create a large set of training data. The disadvantage is that there might be more noise: wrong labels, or bad tweets that do not contain enough information to learn patterns to recognize the types of entities we are looking for.
To address this latter obstacle, we apply several methods to filter Tweets which we deem ‘nice’. One of these methods involves scoring Tweets based on their noise. We applied very simple features to determine this ‘noise-level’ of a tweet; amongst others how many mentions (@’s), hashtags (#’s) and URLs it contains, but also the ratio between upper-case to lower-case letters, the average word length, the tweet’s length, etc. An example for this Twitter-noise-score is below (these are Tweets from the TREC 2011 Microblog corpus we used):
Top 5 quality Tweets
Top 5 noisy Tweets
In addition, we filter Tweets based on the confidence score of the entity linker, so as not to include Tweets that contain unlikely labels.
It is difficult to measure how well we do in finding entities that do not exist on Wikipedia, since we need some sort of ground truth to determine whether we did well or not. As we cannot manually check for 80.000 Tweets whether the identified entities are in or out of Wikipedia, we take a slightly theoretical approach.
If I were to put it in a picture (and I did, conveniently), it’d look like this:
In brief, we take small ‘samples’ of Wikipedia: one such sample represent the “present KB”; the initial state of the KB. The samples are created by removing out X% of the Wikipedia pages (from 10% to 90% in steps of 10). We then label Tweets using the full KB (100%) to create the ground truth: this full KB represents the “future KB”. Our “present KB” then labels the Tweets it knows, and uses the Tweets it cannot link as sources for new entities. If then the NERC (trained on the Tweets labeled by the present KB) manages to identify entities in the set of “unlinkable” Tweets, we can compare the predictions to the ground truth, and measure performance.
We report on standard metrics: Precision & Recall, on two levels: entity and mention level. However, I won’t go into any details here, because I encourage you to read the results and findings in the paper.
Several Dutch media have picked up our work:
Original press release:
Press coverage:
Slides of my talk at #ECIR2014 are now up on Slideshare;
Download a pre-print of Graus, D., Peetz, M-H., Odijk, D., de Rooij, Ork., de Rijke, M. āyourHistory — Semantic linking for a personalized timeline of historic events,ā in CEUR Workshop Proceedings, 2014.
I presented yourHistory at ICT.OPEN 2013:
@dvdgrs Linking entities in personalized events timeline http://t.co/M58mhNkRDX #ICTOPEN2013 @UvA_Amsterdam @mdr pic.twitter.com/ebIVKe4VXQ
ā Lora Aroyo (@laroyo) November 27, 2013
The slides of my talk are up on SlideShare:
And we got nominated for the “Innovation & Entrepreneurship Award” there! (sadly, didn’t win though š ).
For the LinkedUp ChallengeĀ Veni competitionĀ at the Open Knowledge Conference (OKCon), we (Maria-Hendrike Peetz, me, Daan Odijk, Ork de Rooij and Maarten de Rijke) created yourHistory; a Facebook app that uses entity linking for personalized historic timeline generation (using d3.js). Our app got shortlisted (top 8 out of 22 submissions) and is in the running for the first prize of 2000 euro!
Read a small abstract here:
In history we often study dates and events that have little to do with our own life. We make history tangible by showing historic events that are personal and based on your own interests (your Facebook profile). Often, those events are small-scale and escape history books. By linking personal historic events with global events, we to link your life with global history: writing your own personal history book.
Read the full story here;
And try out the app here!
It’s currently still a little rough around the edges. There’s an extensive to-do list, but if you have any feedback or remarks, don’t hesitate to leave me a message below!