For a small dataviz experiment I wanted to create maps of books, by extracting locations (cities, countries, continents, whatever is mentioned in the text) and drawing these on a map. I used the Stanford Named Entity Recognizer to extract the locations from two books: the Bible and Herman Melville’s Moby Dick. I then wrote a small script in python to retrieve the latitude and longitude of the locations using the Google Geocoding API, throw it all in a csv-file and draw it on a map using GeoCommons. I also included an ascending date to the locations, in order to allow an animated visualization of the extracted locations in GeoCommons.
The darker a circle, the more mentions it got (I set the circles opacity to 10%, so overlaying circles automatically darken). There were some issues regarding false positives (Stanford NER identifying persons as locations). And while I didn’t really know what to expect, I was glad to see that the major clusters in both maps did seem to make sense (Nantucket in Moby Dick, around Jerusalem in the Bible). The Bible geomap shows that a lot of places (particularly in the United States) seem to be named after Biblical locations and names. The cluster in the West Coast of the US seems as big as the Middle Eastern cluster, however once you zoom in it becomes clear that it is less tightly packed. Moby Dick’s geomap shows a lot of locations around coastal areas, which seems to make sense, it also mentions a lot of oceans and seas.
Title: Long Term Affect Expression in English Text to Speech Synthesis
Abstract: “Is it possible to improve the perceived naturalness of computer-generated speech by adding the expression of long-term affect? In order to answer this question we built a system which is able to analyse texts for affective content, interpret this affective content with an emotional model (using different initial emotional states), and express the interpreted affect using an affective Text-to-Speech engine. Our findings indicate that adding this extra form of expressiveness to Text-to-Speech engines does in fact improve perceived naturalness.”
For this project I created a script to extract affective content from texts, using the ANEW (Affective Norms for English Words) lexicon. The affective content was represented in PAD values (pleasure, arousal, dominance). We then implemented several ’emotion models’, or persona’s who each processed the raw affective content differently. An example of the raw PAD values from the Bible can be found @ graus.nu/project/bible.
Complaints in the Cloud was the final project for ‘Creative Research’, a course by Maarten Lamers and Bas Haring, as part of the Media Technology MSc. programme’s curriculum, in 2009. Together with Barry Borsboom and René Coenen I tried to find a correlation between complaining behavior on Twitter, and a ‘real word situation’.
Does Twitter represent the state of affairs in the
real world? To research this, we created a dataset
consisting of user-generated delays gathered from the
social network system Twitter, and information on
delays acquired through de Nederlandse Spoorwegen's
RSS feed on delays of the first two weeks of
November. Our approach is motivated by the key
observation that when people get bored, they tend to
grab their mobile phone to kill time. Certain Twitter
search queries show there are a lot of people using
twitter in or around a train(station), usually a place
were people are either waiting or traveling.
The analysis based on our dataset reveals that in
general, amount of Twitter-complaints coincide with
the duration and number of delays. Where the value of
one is high, the other generally is as well. Thus based
on the data at hand, we can conclude that there is in
fact a correlation between the reported delays and
online complaints on Twitter. Unfortunately we didn't
succeed in pointing out a specific relation between the
trajectories and amount of complaints, but this might
well be because of the scope of our research.
The Academic PDF Reader is a project I did together with Bertram Bourdrez for the Human Computer Interaction course, part of the Media Technology curriculum, in 2009. It is an exploration on a new way of displaying and interacting with PDF documents, specifically intended for scientific papers. For the project, I researched the specific process of reading a scientific paper. Based on this research I conceptualized some interaction and design principles.
The main finding is that academic papers are read primarily in a non-serial, scanning fashion, and that the readers generally have an accurate mental model of the document’s structure. To better support this non-serial reading behaviour, we designed a PDF reader with a dual way of presenting documents – on the one hand a horizontal, ‘zoomed-out’ view. Providing a structured overview of the document to offer a good overview of an article’s structure, and support the reader’s mental model of the document. And on the other hand, a more classic ‘reading’ mode, serial reading.
The project consisted out of conceptualizing, researching, designing and developing a novel HCI application, and perform user tests to further evaluate our prototype.
The abstract of our paper:
University students read a large number of scientific articles during their stud-
ies. Choosing to read digital texts directly from the computer screen as opposed
to printing them first can be a time and money saving decision. Experienced
readers of academic articles use a similar approach to reading academic docu-
ments: in a non-serial fashion and by knowing the similar structure these art-
icles generally share. We find current PDF readers are insufficiently capable to
support this method of reading. Our PDF reader proposes to support reading
academic papers better, by translating beneficent properties of reading from
physical paper to the display of digital texts. As the reading method applied by
experienced readers is non-serial, it is important to be able to quickly navigate
through pages and get a clear overview of the text's structure. Our PDF reader
offers an alternative method of displaying digital texts, to optimally support the
reading of academic articles.
The Real Internet Globalizer is a concept for an internet browser widget, designed together with Barry Borsboom. The widget aims to actively contribute to a more globalized internet. It is designed with three principles in mind:
Creating awareness of the geographical size of the user’s ‘personal internet’
Actively contributing in expanding this geographically confined internet
Providing feedback of the progress made in the geographical size of the internet to the user
For the primary feature, The Real Internet Globalizer gathers geographical data of all the news websites the user visits. It displays the distribution of the different countries visited in an infographic on a map of the world, to provide instant insight as to where the user surfs most frequently. For the second feature, TRIG will find and ‘suggest’ similar content to the users, from parts of the world where the user normally doesn’t surf. This will make the user able to judge if other countries provide other points of view. By following the suggestions, the user will have a more globalized surfing behaviour.
The Real Internet Globalizer was inspired by a talk by Ethan Zuckerman during the Cloud Intelligence Symposium at the Ars Electronica Festival in 2009.
12/12/12 update: since @sem_web moved to live in my Raspberry Pi, I’ve renamed him @grausPi
The last couple of days I’ve spent working on my graduation project by working on a side-project: @sem_web; a Twitter-bot who queries DBPedia [wikipedia’s ‘linked data’ equivalent] for knowledge.
@sem_web is able to recognize 249 concepts, defined by the DBPedia ontology, and sends SPARQL queries to the DBPedia endpoint to retrieve more specific information about them. Currently, this means that @sem_web can check an incoming tweet (mention) for known concepts, and then return an instance (example) of the concept, along with a property of this instance, and the value for the property. An example of Sam’s output:
[findConcept] findConcept('video game')
[findConcept] Looking for concept: video game
[findInst] Seed: [u'http://dbpedia.org/class/yago/ComputerGame100458890',
[findInst] Has 367 instances.
[findInst] Instance: Fight Night Round 3
[findProp] Has 11 properties.
[findProp] [u'http://dbpedia.org/property/platforms', u'platforms']
[findVal] Property: platforms (has 1 values)
[findVal] Value: Xbox 360, Xbox, PSP, PS2, PS3
[findVal] Domain: [u'Thing', u'work', u'software']
[findVal] We're talking about a thing...
Fight Night Round 3 is a video game. Its platforms is Xbox 360, Xbox,
PSP, PS2, PS3.
This is how it works:
Look for words occurring in the tweet that match a given concept’s label.
If found (concept): send a SPARQL query to retrieve an instance of the concept (an object with rdf:type concept).
If not found: send a SPARQL query to retrieve a subClass of the concept. Go to step 1 with subClass as concept.
If found (instance): send SPARQL queries to retrieve a property, value and domain of the instance. The domain is used to determine whether @sem_web is talking about a human or a thing.
If no property with a value is found after several tries: Go to step 2 to retrieve a new instance.
Compose a sentence (currently @sem_web has 4 different sentences) with the information (concept, instance, property, value).
Next to that, @sem_web posts random tweets once an hour, by picking a random concept from the DBPedia ontology. Working on @sem_web allows me to get to grips with both the SPARQL query language, and programming in Python (which, still, is something I haven’t done before in a larger-than-20-lines-of-code way).
What I’m working on next is a method to compare multiple concepts, when @sem_web detects more than one in a tweet. Currently, this works by taking each concept and querying for all the superClasses of the concept. I then store the path from the seed to the topClass (Entity) in a list, repeat the process for the next concept, and then compare both paths to the top, to identify a common parent-Class.
This is relevant for my graduation project as well, because a large task in determining the right subject for a text will be to determine the ‘proximity’ or similarity of different concepts in the text. Still, that specific task of determining ‘similarity’ or proximity of concepts is a much bigger thing, finding common superClasses is just a tiny step towards it. There are other interesting relationships to explore, for example partOf/sameAs relations. I’m curious to see what kind of information I will gather with this from larger texts.
An example of the concept comparison in action. From the following tweet:
Picked mendicot: @offbeattravel .. FYI, my Twitter bot
@vagabot found you by parsing (and attempting to answer)
travel questions off the Twitter firehose ..
The findCommonParent function takes two URIs and processes them, appending a new list with the superClasses of the initial URI. This way I can track all the ‘hops’ made by counting the list number. As soon as the function processed both URIs, it starts comparing the pathLists to determine the first common parent.
Here you can see the first common parentClass is ‘Event’: 3 hops away from ‘ChangeOfLocation’, and 5 hops away from ‘Locomotion’. If it finds multiple superClasses, it will process multiple URIs at the same time (in one list). Anyway, this is just the basic stuff. There’s plenty more on my to-do list…
While the major part of the functionality I’m building for @sem_web will be directly usable for my thesis project, I haven’t been sitting still with more directly thesis-related things either. I’ve set up a local RDF store (Sesame store) on my laptop with all the needed bio-ontologies. RDFLib’s in-memory stores were clearly not up for the large ontologies I had to load each time. This also means I have to better structure my queries, as all information is not available at any given time. I also – unfortunately – learned that one of my initial plans: finding the shortest path between two nodes in an RDF store to determine ‘proximity’, is actually quite a complicated task. Next I will focus more on improving the concept comparison, taking more properties into account than only rdfs:subClass, and I’ll also work on extracting keywords (which I haven’t, but should have arranged testing data for)… Till next time!
But mostly, the last weeks I’ve been learning SPARQL, improving my Python skills, and getting a better and more concrete idea of the possible approaches for my thesis project by working on sem_web.
Project by Peter Curet & David Graus for the ‘Embodied Vision’ course by Joost Rekveld for the Media Technology MSc. Programme at Leiden University.
We compare the movement of the webcam input (adding up all movement towards the left and right, and up and down). This results in two numbers which represent the total amount of movement since the start.
The turtle graphic system draws on the basis of character-input:
– ‘w’ makes it move forward
– ‘a’ makes it turn left (but doesn’t draw anything)
– ‘d’ makes it turn right (same)
– ‘s’ changes the thickness of the line
– ‘c’ the color
The turtle receives a number of random strings from the genetic algorithm. It calculates the amount and direction of movement each string results in. Then it compares all these numbers to the numbers of the webcam movement. The more alike, the fitter we consider the string. We select the fittest string out of the number of strings it received, and make the turtle draw it. This string is the basis for the ‘next generation’ of strings. It is fed to the genetic algorithm which evolves this string into multiple other strings. The process repeats to infinity. Since the webcam input is dynamic and ever-changing, the fitness of the strings will not gradually rise, but it an ever-changing value.
Het doel van cheeseGame is om alle kaasgaten dicht te maken (want anders zinkt de kaas)! Elk kaasgat heeft 1 kaaskurk, die je met je muis erop moet slepen. Maar pas op, voor je het weet duw je je kaaskurken uit de kaasgaten! Vooral de latere levels zijn een ENORME UITDAGING! De physics en graphics liegen er niet om: