David Graus

Welcome to the newest member of the family

dvdgrs posted a photo:

Samyang 8mm f3.5 fisheye. Skateflicks and hiphopvideos are now a real possibility.

kapstok hack

In de categorie vergiet-hack maar dan niet met een vergiet maar met een kapstok, en bovendien heel erg anders. Dit is wat Jysk van plan was met de Ascot kapstok:

En zo hangt ‘ie nu aan mijn muur:

First results are in! My computer spent about 20 hours to retrieve and store neighboring concepts of over 10.000 concepts which my Breadth-First-Search algorithm passed through to find the shortest paths between 6 nodes. But here is the result, first the ‘before’ graph, which I showed earlier: all retrieved concepts with their parent relations. Below that is the new graph, which relates all concepts by finding their shortest paths (so far only the orange concepts – from the Gene Ontology).

Before

After

So?

What were two separate clusters in the first to the left is now one big fat cluster… Which is cool!

Less cool is the time it took… But oh well, looks like I’m going to have to prepare some examples as proof of concepts. Nowhere near realistic realtime performance so far… (however I got a big speed increase by moving my Sesame triple store from my ancient EeePC900 to my desktop computer… Goodbye supercomputer). The good news is that all neighboring nodes I processed so far are cached in a local SQLite database, so those 20 hours were not a waste! (considering my total ontology database consists out of over 800.000 concepts, and 10.000 concepts took 20 hours, is something I choose not to take into consideration however :p).

It is important to note that the meaning or interpretation of the resulting graph (and particularly the relations between concepts) is not the primary concern here: the paths (their lengths, the directions of the edges and the node’s ‘depths’) will be primarily used for the ontology-based semantic similarity measure I wrote about in this post.

Foggy Tuindorp

dvdgrs posted a photo:

Ontology-based Text Visualization: Towards a Visual Language

The first steps towards creating graph-based text visualizations is thinking about how I would like to visually represent the structure of the information I extract from texts. I have included a preliminary example of the output at the bottom of this page. I use three different methods of extracting concepts from a piece of text:

Counting literal occurrences of concepts (from ontologies)
Finding related concepts by textual comparison of the text to the concepts’ descriptions
Finding related concepts by exploring the ontological structure (aka relating concepts within one ontology by finding paths and parents, and possibly relevant neighbours)

The primary distinction I want to make is that of relevance (aka ‘likeliness of topic’). In the case of 1 this would be the frequency of the word (more occurrences = more relevant concept), in the case of 2 this is the calculated similarity of the source text to the concepts’ descriptions (which is a number between 0 and 1). In the case of 3, this would be by ‘connections’: the more concepts the concepts I find by exploring an ontologies’ structure would link, the more relevant this found concept would be. I want to model this distinction by node’s size: the more relevant a concept is, the bigger I want to draw the concept in the graph.

The second distinction is that of literal to non-literal (1 being literal, 2 and 3 being non-literal). I want to model this distinction by style: literal concepts will be drawn as a filled circle, non-literal concepts as outlined circles.

The third distinction is that of the concepts’ source: from which ontology does a concept originate? This distinction will be modeled by color: each ontology (of the six I use) will have its distinct color. Explored concepts (from step 3) such as parents and shared parents will be colored in distinct colours as well, since they are connected in the graph to the coloured nodes, their source will implicitly be clear.

Since Gephi doesn’t fully support Graphviz’ DOT-language, and the graph library I use in Python conveniently parses graphs in DOT, I use Graphviz to directly render the results.

An issue I came across with the scaling (to represent relevance), is that I’m working with multiple measures: frequency of literal words (1), percentage of text-similarity (2), and degree-count. In an effort to roughly equalize the scaling factor, I decided to use static counters. Each node gets an initial size of 0.5 (0.4 for shared parents, and 0.3 for parents). For each occurrence of a literal word, I add 0.05. For the text-similarity, I add the percentage of 0.5 (26% similarity = 0.5 + (0.5*0.26)). For the degree, I add 0.1 for each in-link the node receives. This is an initial attempt at unifying the results. Anyway, these are just settings I’m playing around with.

Examples

These are very rough examples, because:

I use the literal representation of the input text (as I have not yet determined the most suitable keyword extraction method)
I haven’t determined a proper ‘cut-off’ for the text similarity measure; currently it includes every concept it finds with similarity ≥ 25%.
It doesn’t yet fully incorporate step 3 (it includes parents, but not yet paths between nodes)
It doesn’t scale nodes according to in-links
There is no filtering applied yet (removing obsolete classes, for example).

Add some #dataviz in the mix

All this time I have been working on my ontology-powered topic identification system: combining semantic web technologies with natural language processing and text-mining techniques to extract (a) topic(s) from a text. But I never really decided what my program should output: a topic? A list of potential topics with their ‘likeliness’? That would mean the relationships between topics my SPARQL-powered ‘pathfinder’ finds would disappear in an algorithm that used those paths to calculate semantic similarity, which would be a pitty. This weekend it finally all came together:

Visualize!

I already played around with drawing graphs with Gephi, as a method to check the results in a way other than processing huge lists in my python interpreter. But now I realize it could be the perfect ‘final product’. What I want to create is a ‘semantically-augmented tag-cloud-graph‘. As opposed to a standard tagcloud, my augmented tag graph will be a visualization of the text, composed of both interlinked concepts and solitary concepts, concepts that can be found literally in the text and concepts that aren’t mentioned anywhere in the text. The tag graph will benefit from the linked data nature of ontologies to:

1. Show relationships between concepts
2. Show extra concepts which do not occur literally in the text: nodes that occur in a path between two nodes, and maybe ‘superClass’ and ‘subClass’ nodes.

In this way, it will be similar to a tag cloud as it conveys the content of a text, but augmented as it could convey the meaning of a text and relationships between tags. It will also be similar to the DBPedia RelFinder, but augmented as it will show links but also contain a hierarchy of more to less important concepts. I hope that in the end it could be a viable alternative of graphic text-representation.

Visualize what?

As I see it, my semantic graph-cloud should communicate at least these three things:
1. The most likely topic of the text
2. The concepts which occur literally in the text versus concepts that do not occur in the text
3. Clusters of similar concepts

My first idea is to model these three properties by size, alpha-channel and colour respectively, the bigger node, the more likely the topic, transparant nodes are the ones that do not occur in the text, and coloured nodes to group semantically similar nodes. But that’s just my initial plan, I might have to give it some more thoughts and experiment with it.

Next, I should think about a method to ‘measure’ whether my semantic tagcloud conveys more information, or has ajy added value. In the end, I am totally unsure whether the resulting tag graphs will make sense as it depends on multiple factors such as the efficiency of my keyword extraction, the efficiency of the vector-space based string comparison, the quality of the ontologies, etc.

The good news is, most of the technical work is in a close-to-finished state. I have various methods of extracting keywords from large texts (which I will be validating soon), I have a functional method to find ‘new’ concepts by comparing a text to all the descriptions of ontology concepts, I have a functional shortest-path finder to explore how two concepts are related (and produce a graph of it). It’s just a matter of putting it all together and selecting a suitable tool to draw the graphs. I don’t think I’ll use Gephi, as I want to fully integrate the graph drawing in my script, so who knows, maybe it’s back to NetworkX, igraph, or maybe Protovis?

In the mean time I’m looking at validating the data I generated with the various keyword extraction algorithms. By using human experts and cross-validating with existing keywords from certain entries, I’ll be able to evaluate which method of keyword extraction is the most efficient for my purposes. Once that’s done, and I’ve picked my graph-drawing method, I can start showing some preliminary results!

Geomapping the Bible and Herman Melville’s Moby Dick

For a small dataviz experiment I wanted to create maps of books, by extracting locations (cities, countries, continents, whatever is mentioned in the text) and drawing these on a map. I used the Stanford Named Entity Recognizer to extract the locations from two books: the Bible and Herman Melville’s Moby Dick. I then wrote a small script in python to retrieve the latitude and longitude of the locations using the Google Geocoding API, throw it all in a csv-file and draw it on a map using GeoCommons. I also included an ascending date to the locations, in order to allow an animated visualization of the extracted locations in GeoCommons.

The darker a circle, the more mentions it got (I set the circles opacity to 10%, so overlaying circles automatically darken). There were some issues regarding false positives (Stanford NER identifying persons as locations). And while I didn’t really know what to expect, I was glad to see that the major clusters in both maps did seem to make sense (Nantucket in Moby Dick, around Jerusalem in the Bible). The Bible geomap shows that a lot of places (particularly in the United States) seem to be named after Biblical locations and names. The cluster in the West Coast of the US seems as big as the Middle Eastern cluster, however once you zoom in it becomes clear that it is less tightly packed. Moby Dick’s geomap shows a lot of locations around coastal areas, which seems to make sense, it also mentions a lot of oceans and seas.

Continue reading “Geomapping the Bible and Herman Melville’s Moby Dick”

Affective computing project

Project with René Coenen and Ronald Bezemer, for the elective ‘Seminar Affective Computing‘, at the Delft University of Technology.

Title: Long Term Affect Expression in English Text to Speech Synthesis

Abstract: “Is it possible to improve the perceived naturalness of computer-generated speech by adding the expression of long-term affect? In order to answer this question we built a system which is able to analyse texts for affective content, interpret this affective content with an emotional model (using different initial emotional states), and express the interpreted affect using an affective Text-to-Speech engine. Our findings indicate that adding this extra form of expressiveness to Text-to-Speech engines does in fact improve perceived naturalness.”

For this project I created a script to extract affective content from texts, using the ANEW (Affective Norms for English Words) lexicon. The affective content was represented in PAD values (pleasure, arousal, dominance). We then implemented several ’emotion models’, or persona’s who each processed the raw affective content differently. An example of the raw PAD values from the Bible can be found @ graus.nu/project/bible.

Bridge in Amsterdam

Scheepvaartmuseum’s roof

Scheepvaartmuseum court

Fishy crane @ NDSM

Welcome to the newest member of the family

Streepjes