Ontology-based Text Visualization: Towards a Visual Language

The first steps towards creating graph-based text visualizations is thinking about how I would like to visually represent the structure of the information I extract from texts. I have included a preliminary example of the output at the bottom of this page. I use three different methods of extracting concepts from a piece of text:

  1. Counting literal occurrences of concepts (from ontologies)
  2. Finding related concepts by textual comparison of the text to the concepts’ descriptions
  3. Finding related concepts by exploring the ontological structure (aka relating concepts within one ontology by finding paths and parents, and possibly relevant neighbours)

The primary distinction I want to make is that of relevance (aka ‘likeliness of topic’). In the case of 1 this would be the frequency of the word (more occurrences = more relevant concept), in the case of 2 this is the calculated similarity of the source text to the concepts’ descriptions (which is a number between 0 and 1). In the case of 3, this would be by ‘connections’: the more concepts the concepts I find by exploring an ontologies’ structure would link, the more relevant this found concept would be. I want to model this distinction by node’s size: the more relevant a concept is, the bigger I want to draw the concept in the graph.

The second distinction is that of literal to non-literal (1 being literal, 2 and 3 being non-literal). I want to model this distinction by style: literal concepts will be drawn as a filled circle, non-literal concepts as outlined circles.

The third distinction is that of the concepts’ source: from which ontology does a concept originate? This distinction will be modeled by color: each ontology (of the six I use) will have its distinct color. Explored concepts (from step 3) such as parents and shared parents will be colored in distinct colours as well, since they are connected in the graph to the coloured nodes, their source will implicitly be clear.

Since Gephi doesn’t fully support Graphviz’ DOT-language, and the graph library I use in Python conveniently parses graphs in DOT, I use Graphviz to directly render the results.

An issue I came across with the scaling (to represent relevance), is that I’m working with multiple measures: frequency of literal words (1), percentage of text-similarity (2), and degree-count. In an effort to roughly equalize the scaling factor, I decided to use static counters. Each node gets an initial size of 0.5 (0.4 for shared parents, and 0.3 for parents). For each occurrence of a literal word, I add 0.05. For the text-similarity, I add the percentage of 0.5 (26% similarity = 0.5 + (0.5*0.26)). For the degree, I add 0.1 for each in-link the node receives. This is an initial attempt at unifying the results. Anyway, these are just settings I’m playing around with.


These are very rough examples, because:
  • I use the literal representation of the input text (as I have not yet determined the most suitable keyword extraction method)
  • I haven’t determined a proper ‘cut-off’ for the text similarity measure; currently it includes every concept it finds with similarity ≥ 25%.
  • It doesn’t yet fully incorporate step 3 (it includes parents, but not yet paths between nodes)
  • It doesn’t scale nodes according to in-links
  • There is no filtering applied yet (removing obsolete classes, for example).

Python graphs and visualizations

To my right is a visualization of the output of my SPARQL-powered shortest path algorithm, finding a link between ‘intracellular and extracellular accumulation’ & ‘developmental and adult structural defect’, 2 concepts in the Mouse Pathology ontology. Click it! It shows the two ‘source’ concepts in white, and the shortest path (of 3 nodes: 4 hops) in grey. It looks like this in python:

[[u’http://purl.obolibrary.org/obo/MPATH_56′, u’http://www.w3.org/2000/01/rdf-schema#subClassOf’, u’http://purl.obolibrary.org/obo/MPATH_55′], [u’http://purl.obolibrary.org/obo/MPATH_55′, u’http://www.w3.org/2000/01/rdf-schema#subClassOf’, u’http://purl.obolibrary.org/obo/MPATH_603′], [u’http://purl.obolibrary.org/obo/MPATH_1′, u’http://www.w3.org/2000/01/rdf-schema#subClassOf’, u’http://purl.obolibrary.org/obo/MPATH_603′], [‘http://purl.obolibrary.org/obo/MPATH_33′, u’http://www.w3.org/2000/01/rdf-schema#subClassOf’, u’http://purl.obolibrary.org/obo/MPATH_1′]]

Which is clearly less awesome.

My algorithm generates a directed subgraph of all the nodes and edges it encounters during its search for a path between two ontology concepts. I figured generating this subgraph would make it easier to get some of the variables I need for the semantic similarity measurement calculation (such as amount changes in directions in the path, node’s depth from the root, etc.). Furthermore, I can use the subgraph to more easily assign weights to the textual data surrounding the nodes, when assembling my bag-of-words model of a node’s direct context, as I’ve explained in my previous post.

What to use

There are heaps of libraries for managing graphs in Python, and loads of programs to visualize and manipulate graphs. Here’s my stuff of choice.

After looking at several python-graph libraries ( NetworkX, igraph, graph-tool, etc.) I choose to use python-graph, which in my opinion is the most straightforward, compact, simple yet powerful graph-lib for python. I use python-graph to generate a file describing the subgraph in DOT language (“plain text graph description language”). This file can then be imported in a wide array of programs to visualize and manipulate the graph.

Visualizing the subgraph containing the shortest path between two nodes would allow me to get a better picture of what my algorithm fetches, and also to double-check the results (seeing is believing ;)). To visualize graphs, there are plenty of options again. After sifting through Tulip, Graphviz and some other obscure programs I stumbled upon Gephi, a very complete, pretty and simple open-source graph visualization & manipulation program. It has extensive documentation, and several advanced features to manipulate the graph and fetch some values. Ideally though, I will manage all those ‘advanced value-fetching tasks’ in my python script. Gephi still provides a nice and quick way to double-check some of the output and get a more concrete idea about what’s happening, as things can get pretty complex, pretty fast:

1136 nodes of the DOID-ontology, subgraph produced when finding a link for DOID_4 (disease) and DOID_10652 (Alzheimer's Disease) - red nodes in the graph. Linked by the three yellow nodes.