yourHistory – Entity linking for a personalized timeline of historic events

Download a pre-print of Graus, D., Peetz, M-H., Odijk, D., de Rooij, Ork., de Rijke, M. “yourHistory — Semantic linking for a personalized timeline of historic events,” in CEUR Workshop Proceedings, 2014.

Update #1

I presented yourHistory at ICT.OPEN 2013:

The slides of my talk are up on SlideShare:

And we got nominated for the “Innovation & Entrepreneurship Award” there! (sadly, didn’t win though ;) ).

nominated

Original Post

yourHistory - OKConference poster

For the LinkedUp Challenge Veni competition at the Open Knowledge Conference (OKCon), we (Maria-Hendrike Peetz, me, Daan Odijk, Ork de Rooij and Maarten de Rijke) created yourHistory; a Facebook app that uses entity linking for personalized historic timeline generation (using d3.js). Our app got shortlisted (top 8 out of 22 submissions) and is in the running for the first prize of 2000 euro!

Read a small abstract here:

In history we often study dates and events that have little to do with our own life. We make history tangible by showing historic events that are personal and based on your own interests (your Facebook profile). Often, those events are small-scale and escape history books. By linking personal historic events with global events, we to link your life with global history: writing your own personal history book.

Read the full story here;

And try out the app here!

It’s currently still a little rough around the edges. There’s an extensive to-do list, but if you have any feedback or remarks, don’t hesitate to leave me a message below!

How many things took place between 1900 and today? DBPedia knows

For a top-secret project, I am looking at retrieving all entities that represent a ‘(historic) event’, from DBPedia.

Now I could rant about how horrible it is to actually formulate a ‘simple’ query like this, using the structured anarchistic Linked Data format, so I will: this request “give me all entities that represent ‘events’ from DBPedia” takes me 3 SPARQL queries, since different predicates represent the same thing, but probably I need a lot more to get a proper subset of the entities I’m looking for. Currently, I filter for entities that have a dbpedia-owl:date property, a dbprop:date property (yes, these predicated express the exact same property) and entities that belong to the Event class.

Anyway, if we count for each year how many event entities there are, we get the following graph:

events

Which is interesting, because it shows how there are loads of events in the near past, and around WWII, and around WWI. I could now say something about how interesting it is that our collective memory is focused on the near past, but then I looked at the events and saw loads of sports events, so I won’t, but rather say that back in the days we were terrible at organizing sports events. Still, the knowledge that between 1900 and today a total of 16.589 events happened seems significant to me.

Graduation pt. 4: What’s next

Just a quick update to let myself know what’s going to happen next: It’s time to produce some results! While I was getting quite stuck in figuring out the best – or rather, most practical – way to extract keywords from a text (and not just any text, mind you, but notes of biologists), my supervisor told me it’s time to get some results. Hard figures. I decided to scrap POS-tagging the notes to extract valuable phrases, after I noticed the accuracy of the default NLTK POS-tagger was way below practical usage. Not too surprising, considering the default NLTK tagger is probably not trained on biologists’ notes.

Anyway, we came up with the following tests:

I use two sources:

  1. The first being the biologist’s notes (the Cyttron DB).
  2. The second being specific Wikipedia pages on recurring topics of the Cyttron DB:
    Alzheimer, Apoptis, Tau protein & Zebrafish.

From these two sources, I will use five different methods of representing the information:

  1. Literal representation (using each word, no edit)
  2. Simple keyword extraction (using word frequency after subtracting english stopwords)
  3. Bigram collocations
  4. Trigram collocations
  5. Keyword combo (word frequency + bigrams + trigrams)

Each of these ways of representing the source information can then be ‘boosted’  by using WordNet to generate synonyms, doubling the ways of representing the data (2×5=10!).

With these 10 different representations of each of the two datasources (2×10), I will use 3 ways to try to determine the subject:

  1. Literal label matching using two different sets of ontologies:
    1. Cyttron-set: Gene Ontology, Human Disease Ontology, Mouse Pathology & National Cancer Institute Thesaurus
    2. DBPedia ontology
  2. Matching the sources to descriptions of ontologyterms, using the same two sets of ontologies.
  3. If I manage: Matching the datasources to ‘context‘ of ontology terms.
    I started working on a method to take a term in an ontology and explore its surrounding nodes. I will collect all ‘literals’ attached to a node, and throw them in a big pile of text. I will then use this pile of text as a bag of words, to match to the datasources.

This will bring the total amount of tests to be done to 120:

  2 sources (wiki/cyttron)
 10 representations of these sources
  3 methods (literal/desc/context)
  2 ontologies (cyttron-set/dbpedia)
 -------------------------------------
  2 * 10 * 3 * 2 = 120

And in-between I also have Lowlands 2011 and Vollt to attend. Oh gosh…

[read all my thesis-related posts]

DBPedia Twitterbot: Introducing @grausPi!

12/12/12 update: since @sem_web moved to live in my Raspberry Pi, I’ve renamed him @grausPi

The last couple of days I’ve spent working on my graduation project by working on a side-project: @sem_web; a Twitter-bot who queries DBPedia [wikipedia’s ‘linked data’ equivalent] for knowledge.

@sem_web is able to recognize 249 concepts, defined by the DBPedia ontology, and sends SPARQL queries to the DBPedia endpoint to retrieve more specific information about them. Currently, this means that @sem_web can check an incoming tweet (mention) for known concepts, and then return an instance (example) of the concept, along with a property of this instance, and the value for the property. An example of Sam’s output:

[findConcept] findConcept('video game')
[findConcept] Looking for concept: video game
 [u'http://dbpedia.org/class/yago/ComputerGame100458890', 
'video game']

[findInst] Seed: [u'http://dbpedia.org/class/yago/ComputerGame100458890', 
'video game']
[findInst] Has 367 instances.
[findInst] Instance: Fight Night Round 3

[findProp] Has 11 properties.
[findProp] [u'http://dbpedia.org/property/platforms', u'platforms']

[findVal] Property: platforms (has 1 values)
[findVal] Value: Xbox 360, Xbox, PSP, PS2, PS3
[findVal] Domain: [u'Thing', u'work', u'software']
[findVal] We're talking about a thing...
Fight Night Round 3 is a video game. Its platforms is Xbox 360, Xbox, 
PSP, PS2, PS3.

This is how it works:

  1. Look for words occurring in the tweet that match a given concept’s label.
  2. If found (concept): send a SPARQL query to retrieve an instance of the concept (an object with rdf:type concept).
  3. If not found: send a SPARQL query to retrieve a subClass of the concept. Go to step 1 with subClass as concept.
  4. If found (instance): send SPARQL queries to retrieve a property, value and domain of the instance. The domain is used to determine whether @sem_web is talking about a human or a thing.
  5. If no property with a value is found after several tries: Go to step 2 to retrieve a new instance.
  6. Compose a sentence (currently @sem_web has 4 different sentences) with the information (concept, instance, property, value).
  7. Tweet!

Next to that, @sem_web posts random tweets once an hour, by picking a random concept from the DBPedia ontology. Working on @sem_web allows me to get to grips with both the SPARQL query language, and programming in Python (which, still, is something I haven’t done before in a larger-than-20-lines-of-code way).

Comparing concepts

What I’m working on next is a method to compare multiple concepts, when @sem_web detects more than one in a tweet. Currently, this works by taking each concept and querying for all the superClasses of the concept. I then store the path from the seed to the topClass (Entity) in a list, repeat the process for the next concept, and then compare both paths to the top, to identify a common parent-Class.

This is relevant for my graduation project as well, because a large task in determining the right subject for a text will be to determine the ‘proximity’ or similarity of different concepts in the text. Still, that specific task of determining ‘similarity’ or proximity of concepts is a much bigger thing, finding common superClasses is just a tiny step towards it. There are other interesting relationships to explore, for example partOf/sameAs relations. I’m curious to see what kind of information I will gather with this from larger texts.

An example of the concept comparison in action. From the following tweet:

>>> randomFriend()
Picked mendicot: @offbeattravel .. FYI, my Twitter bot 
@vagabot found you by parsing (and attempting to answer) 
travel questions off the Twitter firehose ..

I received the following concepts:

5 concepts found.
[u'http://dbpedia.org/class/yago/Bot102311879',
u'http://dbpedia.org/class/yago/ChangeOfLocation107311115',
u'http://dbpedia.org/class/yago/FYI(TVSeries)',
u'http://dbpedia.org/class/yago/Locomotion100283127',
u'http://dbpedia.org/class/yago/Travel100295701']

The findCommonParent function takes two URIs and processes them, appending a new list with the superClasses of the initial URI. This way I can track all the ‘hops’ made by counting the list number. As soon as the function processed both URIs, it starts comparing the pathLists to determine the first common parent.

>>> findCommonParents(found[1],found[3])

[findParents]	http://dbpedia.org/class/yago/ChangeOfLocation107311115
[findParents]	Hop | Path:
[findParents]	0   | [u'http://dbpedia.org/class/yago/ChangeOfLocation107311115']
[findParents]	1   | [u'http://dbpedia.org/class/yago/Movement107309781']
[findParents]	2   | [u'http://dbpedia.org/class/yago/Happening107283608']
[findParents]	3   | [u'http://dbpedia.org/class/yago/Event100029378']
[findParents]	4   | [u'http://dbpedia.org/class/yago/PsychologicalFeature100023100']
[findParents]	5   | [u'http://dbpedia.org/class/yago/Abstraction100002137']
[findParents]	6   | [u'http://dbpedia.org/class/yago/Entity100001740']
[findCommonP]	1st URI processed

[findParents]	http://dbpedia.org/class/yago/Locomotion100283127
[findParents]	Hop | Path:
[findParents]	0   | [u'http://dbpedia.org/class/yago/Locomotion100283127']
[findParents]	1   | [u'http://dbpedia.org/class/yago/Motion100279835']
[findParents]	2   | [u'http://dbpedia.org/class/yago/Change100191142']
[findParents]	3   | [u'http://dbpedia.org/class/yago/Action100037396']
[findParents]	4   | [u'http://dbpedia.org/class/yago/Act100030358']
[findParents]	5   | [u'http://dbpedia.org/class/yago/Event100029378']
[findParents]	6   | [u'http://dbpedia.org/class/yago/PsychologicalFeature100023100']
[findParents]	7   | [u'http://dbpedia.org/class/yago/Abstraction100002137']
[findParents]	8   | [u'http://dbpedia.org/class/yago/Entity100001740']
[findCommonP]	2nd URI processed

[findCommonP]	CommonParent found!
[findCommonP]	Result1[3][0] [findCommonP]	matches with result2[5][0]
[findCommonP]	http://dbpedia.org/class/yago/Event100029378
[findCommonP]	http://dbpedia.org/class/yago/Event100029378

Here you can see the first common parentClass is ‘Event’: 3 hops away from ‘ChangeOfLocation’, and 5 hops away from ‘Locomotion’. If it finds multiple superClasses, it will process multiple URIs at the same time (in one list). Anyway, this is just the basic stuff. There’s plenty more on my to-do list…

While the major part of the functionality I’m building for @sem_web will be directly usable for my thesis project, I haven’t been sitting still with more directly thesis-related things either. I’ve set up a local RDF store (Sesame store) on my laptop with all the needed bio-ontologies. RDFLib’s in-memory stores were clearly not up for the large ontologies I had to load each time. This also means I have to better structure my queries, as all information is not available at any given time. I also – unfortunately – learned that one of my initial plans: finding the shortest path between two nodes in an RDF store to determine ‘proximity’, is actually quite a complicated task. Next I will focus more on improving the concept comparison, taking more properties into account than only rdfs:subClass, and I’ll also work on extracting keywords (which I haven’t, but should have arranged testing data for)… Till next time!

But mostly, the last weeks I’ve been learning SPARQL, improving my Python skills, and getting a better and more concrete idea of the possible approaches for my thesis project by working on sem_web.

[All thesis-related posts]