Force-Directed Graphs: Playing around with D3.js

Update: Newer example of Force-Directed d3.js Graph here: Measure and Visualize Semantic Similarity Between Subgraphs

I recently replaced python-graph in my code with NetworkX, a slightly more sophisticated graph library for Python. Besides some more advanced algorithms for graph analysis (comparison, unison etc.) which can prove useful when analyzing data (comparing human data with mine, for example), I can also easily export my graphs to all kinds of formats. For example, to JSON. As I was getting a bit tired of GraphViz’ stubborn methods, and it’s far from dynamic approach, I decided to start playing around with the excellent Data Driven Documents JavaScript library, better known as D3.js, the successor to Protovis. Actually I had planned this quite a while ago, simply because I was impressed with the Force-directed Graph example on their website. I figured for coolness sake, I should implement them, instead of using the crummy GraphViz graphs.

So after a night and day of tinkering with the D3 code (starting from the Graph example included in the release, modifying stuff as I went) I came to this:

Click to play!

The red nodes are the concepts taken from the texts (either literal: filled red circles, or resulting from text classification: red donuts). The orange nodes are LCS-nodes (Lowest Common Subsumers), aka ‘parent’ nodes, and all the grey ones are simply in-between nodes (either for shortest paths between nodes, or parent nodes).

I added the labels, and also implemented zoom and panning functionality (mousewheel to zoom, click and drag to pan), included some metadata (hover with mouse over nodes to see their URI, over edges to see the relation). I am really impressed with the flexibility of D3, it’s amazing that I can now load up any random graph produced from my script, and  instantly see results.

The bigger plan is to make a fully interactive Graph, by starting with the ‘semantic similarity’ graph (where only the red nodes are displayed), and where clicking on edges expands the graph, by showing the relationship between two connected nodes. Semantic expansion at the click of a mouse ;)!

In other news

I’ve got a date for my graduation! If everything goes right, March 23rd is the day I’ll present my finished project. I’ll let you know once it’s final.

Pathfinder finds paths!

First results are in! My computer spent about 20 hours to retrieve and store neighboring concepts of over 10.000 concepts which my Breadth-First-Search algorithm passed through to find the shortest paths between 6 nodes. But here is the result, first the ‘before’ graph, which I showed earlier: all retrieved concepts with their parent relations. Below that is the new graph, which relates all concepts by finding their shortest paths (so far only the orange concepts – from the Gene Ontology).




What were two separate clusters in the first to the left is now one big fat cluster… Which is cool!

Less cool is the time it took… But oh well, looks like I’m going to have to prepare some examples as proof of concepts. Nowhere near realistic realtime performance so far… (however I got a big speed increase by moving my Sesame triple store from my ancient EeePC900 to my desktop computer… Goodbye supercomputer). The good news is that all neighboring nodes I processed so far are cached in a local SQLite database, so those 20 hours were not a waste! (considering my total ontology database consists out of over 800.000 concepts, and 10.000 concepts took 20 hours, is something I choose not to take into consideration however :p).

It is important to note that the meaning or interpretation of the resulting graph (and particularly the relations between concepts) is not the primary concern here: the paths (their lengths, the directions of the edges and the node’s ‘depths’) will be primarily used for the ontology-based semantic similarity measure I wrote about in this post.

Ontology-based Text Visualization: Towards a Visual Language

The first steps towards creating graph-based text visualizations is thinking about how I would like to visually represent the structure of the information I extract from texts. I have included a preliminary example of the output at the bottom of this page. I use three different methods of extracting concepts from a piece of text:

  1. Counting literal occurrences of concepts (from ontologies)
  2. Finding related concepts by textual comparison of the text to the concepts’ descriptions
  3. Finding related concepts by exploring the ontological structure (aka relating concepts within one ontology by finding paths and parents, and possibly relevant neighbours)

The primary distinction I want to make is that of relevance (aka ‘likeliness of topic’). In the case of 1 this would be the frequency of the word (more occurrences = more relevant concept), in the case of 2 this is the calculated similarity of the source text to the concepts’ descriptions (which is a number between 0 and 1). In the case of 3, this would be by ‘connections’: the more concepts the concepts I find by exploring an ontologies’ structure would link, the more relevant this found concept would be. I want to model this distinction by node’s size: the more relevant a concept is, the bigger I want to draw the concept in the graph.

The second distinction is that of literal to non-literal (1 being literal, 2 and 3 being non-literal). I want to model this distinction by style: literal concepts will be drawn as a filled circle, non-literal concepts as outlined circles.

The third distinction is that of the concepts’ source: from which ontology does a concept originate? This distinction will be modeled by color: each ontology (of the six I use) will have its distinct color. Explored concepts (from step 3) such as parents and shared parents will be colored in distinct colours as well, since they are connected in the graph to the coloured nodes, their source will implicitly be clear.

Since Gephi doesn’t fully support Graphviz’ DOT-language, and the graph library I use in Python conveniently parses graphs in DOT, I use Graphviz to directly render the results.

An issue I came across with the scaling (to represent relevance), is that I’m working with multiple measures: frequency of literal words (1), percentage of text-similarity (2), and degree-count. In an effort to roughly equalize the scaling factor, I decided to use static counters. Each node gets an initial size of 0.5 (0.4 for shared parents, and 0.3 for parents). For each occurrence of a literal word, I add 0.05. For the text-similarity, I add the percentage of 0.5 (26% similarity = 0.5 + (0.5*0.26)). For the degree, I add 0.1 for each in-link the node receives. This is an initial attempt at unifying the results. Anyway, these are just settings I’m playing around with.


These are very rough examples, because:
  • I use the literal representation of the input text (as I have not yet determined the most suitable keyword extraction method)
  • I haven’t determined a proper ‘cut-off’ for the text similarity measure; currently it includes every concept it finds with similarity ≥ 25%.
  • It doesn’t yet fully incorporate step 3 (it includes parents, but not yet paths between nodes)
  • It doesn’t scale nodes according to in-links
  • There is no filtering applied yet (removing obsolete classes, for example).