Not dead (yet)

While I haven’t been as active and hard working on my graduation project as I would have liked to be, I am not dead (nor the project). Earlier this week I presented my project to the Bio-imaging group of Leiden University, which helped me a lot. I was able to present my project pretty much as-is, since I’m mostly done with the technical parts. I received valuable feedback and got good insights into what I should explain more thoroughly in the presentation. Continue reading “Not dead (yet)”

Graduation project

Currently I am working on my final project of the Media Technology MSc. Programme of Leiden University. With the goal of structuring my thoughts and process so far, and because I’ve promised on Twitter, I decided to write a small and simple summary of what my project is about, how I got here and what I’m expecting to do in the next 2-3months.

Currently I am working on my final project of the Media Technology MSc. Programme of Leiden University. With the goal of structuring my thoughts and process so far, and because I’ve promised on Twitter, I decided to write a small and simple summary of what my project is about, how I got here and what I’m expecting to do in the next 2-3months. If you want to jump ahead to what my project is about, jump to here.

A short history of my Media Technology graduation project

The idea of a graduation project for this particular master’s programme is to come up with your own inspiration to conduct a small autonomous research project. As Media Technology resides under the Leiden Institute of Advanced Computer Science faculty, using ‘computer science’ as a tool in your research is not very uncommon.

After finalizing the last couple of courses, I started out looking for inspiration for a research project. From a previous course I came into contact with (low-level) text analysis tasks, using the Python programming language and NLTK (Natural Language ToolKit, a very cool, free and open-source text analysis ‘swiss army knife’). I became interested in the possibilities of (statistical) text analysis. I liked the idea of using simple tools to perform research on the web, so I started looking at the features of NLTK and different Natural Language Processing techniques to include semantics in “web-research”. Having found these starting points, it was time to formulate research proposals.

My initial proposal was not very well fleshed out, more of a way to let the Media Technology board know what I was looking at, and basically to receive a go for the actual work (which to me still was to define my actual project). The proposal involved crawling lots of blogs to perform small scale analyses on, using low-level NLP techniques to go beyond simple statistics and wordfrequency-type research – to include meaning and semantics. The board decided my proposals were concrete enough to approve.

Another part of sending in my proposals and going ahead with the project was finding a supervisor. From a course on AI I took last year I remembered a PhD Student at Leiden University, who was involved/interested in semantics and the semantic web, so I figured he would be the right person to talk to. Soon after contacting him I understood he was only allowed to supervise me if my research contributed to what the Bio-Imaging Group was working on. This worried me at first, but after talking with Joris, I figured my project could actually be close enough to what I wanted to do, with the added advantages that:

  • My research would actually contribute to something
  • My domain would be comfortably restricted

So, what am I actually going to do?

The short explanation: Automatically analyzing and categorizing a large number of texts to be able to define their subjects. In my specific case the texts will be ‘free-form’, natural language descriptions of microscopic imagery, from the Cyttron database. This database contains a large number of images, accompanied by a small description (written by scientists) and a small list of tagwords. That is, if either of these fields are filled in at all. Because of the inconsistent style and method of writing these descriptions, an automated system to properly categorize the images would be very useful.

To perform this analysis, the idea is to use biological ontologies. Ontologies are basically large ‘dictionaries’ containing very specific (biological) terms with their definitions. The ontologies do not only contain their definitions, they also contain how these terms relate to each other. It basically provides me with a hierarchy of terms that says what is part of what, equal to what, etc.

Using these ontologies to analyze the texts allows not only to be able to define the subject of the text, but also to use the data in the ontology to be able to say more about the subject than what can be found in the text.

When I run into problems, I could at some point determine whether existing (biological) ontologies are either missing data, or whether there are more fundamental issues with the matching of the human-produced data with the ontologies.

How am I going to do this?

This part is very much subject to change, as I am taking my first steps in the entire semantic web/OWL/RDF-world, but also in the Python/NLTK-programming world. My current idea is:


  • Python for text-processing
  • RDFLib to read ontologies
  • NLTK for the ‘language tasks’: stemming words, filtering for keywords, etc.


  1. Scanning the database for occurring ontology-terms (literal matches)
  2. Generating a list of keywords from both the free-form text and the ontology-term descriptions, to try to match those if no literal matches are found. I could try this using a bag-of-words-model, to remove all ‘common’ words, and keep the more specific/interesting ones. Another approach is to remove all stopwords from the texts and count the frequency of the remaining words.
  3. Possibly looking at keyphrase extraction instead of simple keywords [or maybe looking at word collocations/chunk extraction?]. 
  4. Apply fuzzy word matching to allow typo’s in the texts. 
  5. Performing a statistical analysis on the likeliness of the subject. My thought is that ‘more specific’ (aka deeper nested) ontology terms should weigh heavier than more general terms. That I might potentially find clusters of terms (a number of terms that are more related to each other than other terms found) to further specify likeliness of subject matter. But I can imagine that when I actually get at this point, new ideas might emerge.
  6. The idea is to acquire some (humanly-checked) training data so I can optimize the system and see what approaches work best.
And that’s about as far as I am right now. The real work: new problems and approaches, will probably surface as soon as I get more into the material.

And what if it works?

Even though this sounds far away currently, I will have to take this scenario into account :p. My idea is to use the software I have written in other domains. Maybe even the domain I was thinking about earlier (using the web as a source for research, blogs, social media, news sites, wiki/dbpedia, etc.). I already came across the OpenCYC Ontology – “hundreds of thousands of terms, along with millions of assertions relating the terms to each other, forming an ontology whose domain is all of human consensus reality”. Which sounds pretty damn awesome.

Some quick ideas I had were using this ontology to create some sort of ‘semantic recommender system’ (on what domain? News sites? Blogs?), or find some other way to extract meaning from large corpora of texts. Anyway, those are ideas for the future, but I hope that I’ll be able to play around a bit with different applications by the time I’ve finished what I’m supposed to do :).

CS Column 2: Evolution

Mobb Deep’s Vision on Evolution Theory

“Yo, yo
We livin’ this till the day that we die
Survival of the fit, only the strong survive”

Mobb Deep, Survival of the Fittest (1995)

While I seriously doubt Mobb Deep’s ‘Survival of the Fittest’ song was intended to enlighten their audience with the ideas of evolution theory, I’d like to refer to this song to discuss the famous “survival of the fittest”-slogan. Because next to the Mobb Deep song (from the album ‘The Infamous’), it’s also a famous, popular and punchy ‘summary’ of Darwin’s evolution theory. It was introduced by Herbert Spencer in 1851 – seven years before Darwin re-used it in his revolutionary “The Origin of Species”.

In their song, Mobb Deep rap about living and surviving the harsh street life in Queens, New York City. Listening to this fine piece of East Coast rap made me wonder how scientifically valid this ‘street knowledge’ they provide us could be…

In the chorus Mobb Deep further elaborate on their title: ‘Survival of the fit, only the strong survive’. Shouldn’t that be ‘Survival of the fit, only the well adapted survive’? It might not sound as nice, but it would be more correct, at least from a evolution theory point of view. Darwin’s evolution theory does not imply the strongest or most physically fit will survive. It implies that individuals that fit best in their environment will! This misinterpretation of the word ‘fit’ in ‘survival of the fittest’ is (unfortunately) a very common one.

Darwin’s evolution theory is not about being strong, it is about adapting to the environment, surviving, and ultimately about reproducing to pass on genes. So, while Prodigy (one of two rappers in Mobb Deep) raps “I’m goin’ out blastin’, takin’ my enemies with me / And if not, they scarred, so they will never forget me” one could argue he’d be better off staying at home and reproducing (which, to be fair, is another recurring theme in Mobb Deep’s work).

But before we accuse Mobb Deep of misunderstanding the the word ‘fit’, let’s consider a possible alternative explanation: the artists of Mobb Deep might completely disagree to the evolution theory as Darwin formulated it. Rather, they might be strong advocates of Herbert Spencer’s ideas – the man who invented the slogan.

Spencer was a firm believer of Social Darwinism (before it was called Social Darwinism): the integration of Darwin’s evolution theory on ideas on human society. It dictates that in society, the strong will survive at cost of the weak, and that man should not offer a helping hand to the weak in society, as that would go against the natural order of things.

A controversial philosophy, especially today, but could it make sense if we put it in the context of Mobb Deep? The rappers came from poor life in the ghetto, worked their way up, sold millions of albums and eventually became wealthy through it. One could argue that Prodigy and Havoc are in fact the fittest to survive in contemporary human society!

Whatever the case, misinterpretation of a word or strong Social Darwinism, the fact remains that ‘survival of the fittest’ is a pretty strong and powerful slogan – one of which I personally do not mind if it’s applied in scientifically correct ways or not!

CS Column 1: Emergence

Emerging chaos: The rules of Vietnamese traffic

When I took this picture in Ho Chi Minh City, Vietnam, I was awe-struck by the chaotic traffic. Dozens of “motobikes” buzz down the streets, seemingly not paying any attention to traffic lanes and rules, oncoming traffic or anything in their vicinity. Cars move through the thick clouds of bikes, and some brave souls even pedal their bicycles straight through it.

For an outsider such as myself, it initially looked like a totally random and chaotic event. Did these people just hope for the best when they were driving through their city? It was obvious all of this chaos would have to work out one way or another. Eventually – I assumed – everyone got where they were going. But how?

Soon I learned there is in fact a systematic at play, and there are plenty of unwritten rules involved in the apparent chaos. You learn this with the one confrontation you cannot avoid: crossing a road on foot (a very intimidating undertaking at first). The basic rule is simple: keep on moving – as long as you do, people manage to anticipate your path and will make sure not crash into you. The next step is that of total immersion: hop on a bike and jump right into traffic.

Once you participate, you realize how simple it actually works. It felt like I was part of a flock – all neighboring motomen adjusted and maintained their speed based on mine and that of the other drivers directly around us. This was not at all obvious when I was observing the traffic from the sidewalk. Eventually I didn’t even worry about horrible fatal accidents anymore, a theme predominantly on my mind when I was only watching the traffic…

Even if it looks simple when you’re in traffic, there is still speeding and overtaking, not everyone is heading to the same destination, so people are constantly moving in and out of the flock. The same principle however applies: when you take a turn, all is fine as long as your movement is fluid. It’s not the turn signals that will save you here: clear and predictable movement will.

What at first seemed totally unnatural to me started feeling more natural, and eventually made sense to me. But it only started to make real sense once I was back home and started reading about swarm intelligence and flocking behavior. The same three rules flocking behavior dictates seem to apply in Vietnamese traffic: separation (avoiding neighbors), alignment (keeping roughly the same direction) and cohesion (sticking together). These simple rules are all you need to create a realistic computer model of a flock of birds, and indeed it’s also all you need to create what seems to be ordered chaos on the roads of a Vietnamese city – I followed the same rules when driving through Ho Chi Minh City on my rental motobike.

As a matter of fact, when I came back home I had to re-adjust to the way traffic works in Holland. Traffic lights, zebra crossings, and the rules of the road were deciding for me where I was going. The Vietnamese traffic which at first seemed unnatural, chaotic and most of all very scary, eventually felt natural, ordered and elegant in its simplicity.