My Raspberry Pi is Alive

That’s a (35$ a.k.a. €45) tiny full-fledged computer, which is hooked up to my home internet connection 24/7, running a webserver (at http://pi.graus.nu — if it’s down, grausPi is off or my internet is down) headless.

It also runs a revived @sem_web (remember him? My linked-data Twitterbot — now called @grausPi), so my Raspberry Pi introduces random-linked-data-fact-noise into the tweetosphere on an hourly basis. So if you too want to stay up to date with mind-boggilingly interesting facts as at the bottom of this post, check out @grausPi.

In the meantime, my Pi waits for a €5 WiFi USB dongle I ordered from DealExtreme, so it can free itself from half of its wires currently attached. Then I’ll think about some more projects to run on this Pi. I still have some servo-motors, LEDs and LDRs (Light Dependent Resistors — simple light meters) left from Arduino days long gone, and apparantly I can control these GPIO pins through Python, as opposed to Arduino’s proprietary language, which is cool. Now for the take-home message:

Did you know the EntrezGene of Biomolecule Chicken ovalbumin upstream promoter-transcription factor is 7026?
— David’s Raspberry Pi (@grausPi) December 12, 2012

ECIR 2014 in Amsterdam!

The Intelligent Systems Lab Amsterdam (ISLA) at the University of Amsterdam has been awarded the hosting of the European Conference on Information Retrieval (ECIR) in 2014. The ‘local’ organization concerning such tasks as arranging a venue, keeping an eye on finances, arranging a social event, arranging accommodation for conference attendants, etc. of this big conference will be largely in hands of me and my fellow PhD candidate-colleagues. In my opinion quite an awesome way of getting some experience in all the aspects involving the organization of such an event.

Within this ECIR2014 ‘local team’, I am trusted with PR/Communication tasks. The first task I’ve done is getting a website up. It lives here: ecir2014.org. The site is designed by Rutger de Vries/Perongeluk and subsequently ever so subtly destroyed into usability by me and colleagues. Another PR task is Twitter, as we have full confidence of Twitter remaining the number one micro-blogging platform in 2014 ;-). So, you know what’s left to do: follow @ECIR2014, like ECIR 2014 & visit ECIR2014.org. And put down in your agenda: April 13th to 17th, 2014. See you there!

Entity Linking – From Words to Concepts

The last couple of weeks I’ve been diving into the task of entity linking, in the context of a submission to the Text Analysis Conference Knowledge Base Population track (that’s quite a mouthful – TAC KBP from now on). A ‘contest’ in Knowledge Base Population with a standardized task, dataset and evaluation. Before I’m devoting my next post to our submission, let me first explain the task of entity linking in this post :-).

Knowledge Base Population

Knowledge Base Population is an information extraction task of generating knowledge bases from raw, unstructured text. A Knowledge Base is essentially a database which describes unique concepts (things) and contains information about these concepts. Wikipedia is a Knowledge Base: each article represents a unique concept, the article’s body contains information about the concept, the infobox provides ‘structured’ information, and links to other pages provide ‘semantic’ (or rather: relational) information.

Filling a Knowledge Base from raw textual data can be broadly split up into two subtasks: entity linking (identifying unique entities in a corpus of documents to add to the Knowledge Base) and slot filling (finding relations of and between these unique entities, and add these to the concepts in the KB).

Entity Linking

Entity linking is the task of linking a concept (thing) to a mention (word/words) in a document. Not unlike semantic annotation, this task is essentially about defining the meaning of a word, by assigning the correct concept to it. Consider these examples:

Bush went to war in Iraq
Bush won over Dukakis
A bush under a tree in the forest
A tree in the forest
A hierarchical tree

It is clear that the bushes and trees in these sentences refer to different concepts. The idea is to link the bold words to the people or things they refer to. To complete this task, we are given:

KB: a Wikipedia-derived knowledge base containing concepts, people, places, etc. Each concept has a title, text (the Wikipedia page’s content), some metadata (infobox properties), etc.
DOC: the (context) document in which the entity we are trying to link occurs (in the case of the examples: the sentence)
Q: The entity-mention query in the document (the word)

The goal is to identify whether the entity referred to by Q is in the KB: this means that if it isn’t, it should be considered a ‘new’ entity. All the new entities should be clustered; that means when two documents refer to the same ‘new’ entity, this must be represented in assigning the same new ID to both words.

A common approach

1. Query Expansion: This means finding more surface forms that refer to entity Q. Two approaches:
I: This can be derived from the document itself, using for example ‘coreference resolution’. In this case you try to identify all strings referring to the same entity. In the Bush example, you might find “President Bush” or “George W. Bush” somewhere in the document.
II: This can also be done by using external information sources, such as looking up the word in a database of Wikipedia anchor texts, page titles, redirect strings or disambiguation pages. Using Wikipedia to expand ‘Q=cheese’ could lead to:
Title: Cheese. Redirects: Home cheesemaking, Cheeses, CHEESE, Cheeze, Chees, Chese, Coagulated milk curd. Anchors: cheese, Cheese, cheeses, Cheeses, maturation, CHEESE, 450 varieties of cheese, Queso, Soft cheese, aging process of cheese, chees, cheese factor, cheese wheel, cheis, double and triple cream cheese, formaggi, fromage, hard cheeses, kebbuck, masification, semi-hard to hard, soft cheese, soft-ripened, wheel of cheese, Fromage, cheese making, cheese-making, cheesy, coagulated, curds, grated cheese, hard, lyres, washed rind, wheels.

2. Candidate Generation: For each surface form of Q, try to find KB entries that could be referred to. Simple approaches are searching for Wikipedia titles that contain the surface form, looking through anchor link texts (titles used to refer to a specific Wikipedia page in another Wikipedia page), expanding acronyms (if you find a string containing only uppercase-letters, try to find a matching word sequence).

3. Candidate Ranking: The final step would be selecting the most probable candidate from the previous step. Simple approaches can be comparing the similarity of the context document to each candidate document (Wikipedia page), more advanced approaches involve measuring semantic similarity on higher levels: e.g. by finding ‘related’ entities in the context document.

4. NIL Clustering: Whenever no candidate can be found (or only candidates with a low probability of being the right one – measured in any way), it could be decided that the entity referred to is not in the KB. In this case, the job is to assign a new ID to the entity, and if it ever is referred in a later document, attach this same ID. This is a matter of (unsupervised) clustering. Successfull approaches include simple string similarity (same ‘new’ entities being referred to by the same word), document similarity (using simple comparisons) or more advanced clustering approaches such as LDA/HDP.

Now this is just a general introduction. If you are interested in the technical side of some common approaches, take a look at the TAC Proceedings. Particularly the Overview of the TAC2011 Knowledge Base Population Trac [PDF], and the Proceedings Papers. If you are unsure why Wikipedia is an awesome resource for entity linking (or any form of extracting structured information from unstructured text), I’d recommend you read ‘Mining Meaning from Wikipedia‘ (Medelyan et al., 2009)

Next post will hopefully be about ILPS’ award-winning entity linking system, so stay tuned ;-).

Researchblog – First post

As I’ve finished my MSc thesis and have started working at the Information and Language Processing Systems group (ILPS) at the Universiteit van Amsterdam as a PhD candidate in a project in Semantic Search for e-Discovery, I figured it’d be time to leave my thesis blog behind, and start a brand new research blog. Continue reading “Researchblog – First post”

New project: dataminen.nl

I have started a Dutch blog on datamining, as I haven’t really come across one, and figured the time is right, with the increase in interest in datajournalism, dataviz, big data, etc. The idea is to provide a general and human-understandable overview of the (academic) field of datamining and the innovations.

I will still use this blog to keep the world informed of my personal endeavours ;-).

It’s a wrap!

This category can soon be archived ;)! Earlier this week I handed in my final paper, and yesterday was the day of my final presentation. It was a great day and I’m really excited about embarking on my next adventure. I will soon start as a PhD candidate at the University of Amsterdam, on a very exciting project in ‘Semantic Search in e-Discovery’ at the Information and Language Processing Systems group. Naturally, this blog will keep the world informed of my work and projects ;). Exciting times!

Paper

Download my paper: Automatic Annotation of Cyttron Entries using the NCIthesaurus [PDF – 328 KB] Download the supplementary data (graphs, tables and viz): Supplementary Data [PDF – 2.27 MB]

Demo

Check out the D3.js-powered demo of a similarity graph (comparing expert & computer-generated annotations)

Subgraph Similarity Example

“(a) A sagittal reconstruction of a coronally acquired magnetic resonance imaging (MRI) scan, at the level on which the cingulate gyrus was measured. The area outlined represents the portion of the scan used to orient the operator to the landmarks of the cingulate. A box has been placed over the region of interest in one hemisphere. (b) A diagram of the cingulate gyrus divided into the rostral portion of the anterior cingulate (RAC), the caudal portion of the interior cingulate (CAC), and the posterior cingulate (PC). Adjoining landmarks include the corpus callosum (CC), the lateral ventricle (Lat. Vent.), and the thalamus (Thal.). (c) The region of the cingulate gyrus measured in the present study, as delineated on the MRI scan of a control subject. […] ” (snippet)

» Check out the d3js demo here «

Two sets of annotations (Expert 1 & Expert 3)

Result in the following similarity graph:

Het kan dooien, het kan vriezen in Tutti Frutti dorp

zo gaat dat hier in Tuindorp

Measure and Visualize Semantic Similarity Between Subgraphs

As I blogged previously, I am working on measuring the performance of my keyword extraction algorithms. The confusion matrix approach I have implemented is quite ‘harsh’. It ignores any semantic information and simply treats the concepts as words, and counts hits and misses between two sets of concepts.

To benefit from the semantic information described in the NCI Thesaurus, and thus produce more detailed results, I will measure the algorithm’s performance by measuring the semantic similarity between the lists of concepts. The two lists (expert data & algorithm) are treated as subgraphs within the main graph: the NCI Thesaurus. Their similarity is measured with a path-based semantic similarity metric, of which there are several. I have implemented Leacock & Chodorow’s measure, as in the literature I found it consistently outperforms similar path-based metrics in the Biomedical domain. Speaking of domain; this measure has originally been designed for WordNet (as many of the other metrics), but has also been used and validated in the Biomedical domain. Hooray for domain-independent, unsupervised and corpus-free approaches to similarity measurement ;-). Continue reading “Measure and Visualize Semantic Similarity Between Subgraphs”

Force-Directed Graphs: Playing around with D3.js

Update: Newer example of Force-Directed d3.js Graph here: Measure and Visualize Semantic Similarity Between Subgraphs

I recently replaced python-graph in my code with NetworkX, a slightly more sophisticated graph library for Python. Besides some more advanced algorithms for graph analysis (comparison, unison etc.) which can prove useful when analyzing data (comparing human data with mine, for example), I can also easily export my graphs to all kinds of formats. For example, to JSON. As I was getting a bit tired of GraphViz’ stubborn methods, and it’s far from dynamic approach, I decided to start playing around with the excellent Data Driven Documents JavaScript library, better known as D3.js, the successor to Protovis. Actually I had planned this quite a while ago, simply because I was impressed with the Force-directed Graph example on their website. I figured for coolness sake, I should implement them, instead of using the crummy GraphViz graphs.

So after a night and day of tinkering with the D3 code (starting from the Graph example included in the release, modifying stuff as I went) I came to this:

Click to play!

The red nodes are the concepts taken from the texts (either literal: filled red circles, or resulting from text classification: red donuts). The orange nodes are LCS-nodes (Lowest Common Subsumers), aka ‘parent’ nodes, and all the grey ones are simply in-between nodes (either for shortest paths between nodes, or parent nodes).

I added the labels, and also implemented zoom and panning functionality (mousewheel to zoom, click and drag to pan), included some metadata (hover with mouse over nodes to see their URI, over edges to see the relation). I am really impressed with the flexibility of D3, it’s amazing that I can now load up any random graph produced from my script, and instantly see results.

The bigger plan is to make a fully interactive Graph, by starting with the ‘semantic similarity’ graph (where only the red nodes are displayed), and where clicking on edges expands the graph, by showing the relationship between two connected nodes. Semantic expansion at the click of a mouse ;)!

In other news

I’ve got a date for my graduation! If everything goes right, March 23rd is the day I’ll present my finished project. I’ll let you know once it’s final.

kapstok hack

In de categorie vergiet-hack maar dan niet met een vergiet maar met een kapstok, en bovendien heel erg anders. Dit is wat Jysk van plan was met de Ascot kapstok:

En zo hangt ‘ie nu aan mijn muur:

Ontology-based Text Visualization: Towards a Visual Language

The first steps towards creating graph-based text visualizations is thinking about how I would like to visually represent the structure of the information I extract from texts. I have included a preliminary example of the output at the bottom of this page. I use three different methods of extracting concepts from a piece of text:

Counting literal occurrences of concepts (from ontologies)
Finding related concepts by textual comparison of the text to the concepts’ descriptions
Finding related concepts by exploring the ontological structure (aka relating concepts within one ontology by finding paths and parents, and possibly relevant neighbours)

The primary distinction I want to make is that of relevance (aka ‘likeliness of topic’). In the case of 1 this would be the frequency of the word (more occurrences = more relevant concept), in the case of 2 this is the calculated similarity of the source text to the concepts’ descriptions (which is a number between 0 and 1). In the case of 3, this would be by ‘connections’: the more concepts the concepts I find by exploring an ontologies’ structure would link, the more relevant this found concept would be. I want to model this distinction by node’s size: the more relevant a concept is, the bigger I want to draw the concept in the graph.

The second distinction is that of literal to non-literal (1 being literal, 2 and 3 being non-literal). I want to model this distinction by style: literal concepts will be drawn as a filled circle, non-literal concepts as outlined circles.

The third distinction is that of the concepts’ source: from which ontology does a concept originate? This distinction will be modeled by color: each ontology (of the six I use) will have its distinct color. Explored concepts (from step 3) such as parents and shared parents will be colored in distinct colours as well, since they are connected in the graph to the coloured nodes, their source will implicitly be clear.

Since Gephi doesn’t fully support Graphviz’ DOT-language, and the graph library I use in Python conveniently parses graphs in DOT, I use Graphviz to directly render the results.

An issue I came across with the scaling (to represent relevance), is that I’m working with multiple measures: frequency of literal words (1), percentage of text-similarity (2), and degree-count. In an effort to roughly equalize the scaling factor, I decided to use static counters. Each node gets an initial size of 0.5 (0.4 for shared parents, and 0.3 for parents). For each occurrence of a literal word, I add 0.05. For the text-similarity, I add the percentage of 0.5 (26% similarity = 0.5 + (0.5*0.26)). For the degree, I add 0.1 for each in-link the node receives. This is an initial attempt at unifying the results. Anyway, these are just settings I’m playing around with.

Examples

These are very rough examples, because:

I use the literal representation of the input text (as I have not yet determined the most suitable keyword extraction method)
I haven’t determined a proper ‘cut-off’ for the text similarity measure; currently it includes every concept it finds with similarity ≥ 25%.
It doesn’t yet fully incorporate step 3 (it includes parents, but not yet paths between nodes)
It doesn’t scale nodes according to in-links
There is no filtering applied yet (removing obsolete classes, for example).

Geomapping the Bible and Herman Melville’s Moby Dick

For a small dataviz experiment I wanted to create maps of books, by extracting locations (cities, countries, continents, whatever is mentioned in the text) and drawing these on a map. I used the Stanford Named Entity Recognizer to extract the locations from two books: the Bible and Herman Melville’s Moby Dick. I then wrote a small script in python to retrieve the latitude and longitude of the locations using the Google Geocoding API, throw it all in a csv-file and draw it on a map using GeoCommons. I also included an ascending date to the locations, in order to allow an animated visualization of the extracted locations in GeoCommons.

The darker a circle, the more mentions it got (I set the circles opacity to 10%, so overlaying circles automatically darken). There were some issues regarding false positives (Stanford NER identifying persons as locations). And while I didn’t really know what to expect, I was glad to see that the major clusters in both maps did seem to make sense (Nantucket in Moby Dick, around Jerusalem in the Bible). The Bible geomap shows that a lot of places (particularly in the United States) seem to be named after Biblical locations and names. The cluster in the West Coast of the US seems as big as the Middle Eastern cluster, however once you zoom in it becomes clear that it is less tightly packed. Moby Dick’s geomap shows a lot of locations around coastal areas, which seems to make sense, it also mentions a lot of oceans and seas.

Continue reading “Geomapping the Bible and Herman Melville’s Moby Dick”

Affective computing project

Project with René Coenen and Ronald Bezemer, for the elective ‘Seminar Affective Computing‘, at the Delft University of Technology.

Title: Long Term Affect Expression in English Text to Speech Synthesis

Abstract: “Is it possible to improve the perceived naturalness of computer-generated speech by adding the expression of long-term affect? In order to answer this question we built a system which is able to analyse texts for affective content, interpret this affective content with an emotional model (using different initial emotional states), and express the interpreted affect using an affective Text-to-Speech engine. Our findings indicate that adding this extra form of expressiveness to Text-to-Speech engines does in fact improve perceived naturalness.”

For this project I created a script to extract affective content from texts, using the ANEW (Affective Norms for English Words) lexicon. The affective content was represented in PAD values (pleasure, arousal, dominance). We then implemented several ’emotion models’, or persona’s who each processed the raw affective content differently. An example of the raw PAD values from the Bible can be found @ graus.nu/project/bible.

#OccupyAmsterdam wordle

Wordle van de 200 meest voorkomende woorden in tweets met hashtag #OccupyAmsterdam. Gemaakt van 5.239 tweets van tussen zaterdag 8 oktober 09:55 uur en 16 oktober 15:50 uur.
Handmatig gefilterd op nicknames en nietszeggende woorden. Hier is de lijst van de 1000 meest voorkomende woorden: OccupyAmsterdam-woorden.

Category: Blog