David's subgraph

PhD Candidate Semantic Search in eDiscovery

Python graphs and visualizations

Sunday, September 18, 2011
2,396 views
0 comments

To my right is a visualization of the output of my SPARQL-powered shortest path algorithm, finding a link between ‘intracellular and extracellular accumulation’ & ‘developmental and adult structural defect’, 2 concepts in the Mouse Pathology ontology. Click it! It shows the two ‘source’ concepts in white, and the shortest path (of 3 nodes: 4 hops) in grey. It looks like this in python:

[[u'http://purl.obolibrary.org/obo/MPATH_56', u'http://www.w3.org/2000/01/rdf-schema#subClassOf', u'http://purl.obolibrary.org/obo/MPATH_55'], [u'http://purl.obolibrary.org/obo/MPATH_55', u'http://www.w3.org/2000/01/rdf-schema#subClassOf', u'http://purl.obolibrary.org/obo/MPATH_603'], [u'http://purl.obolibrary.org/obo/MPATH_1', u'http://www.w3.org/2000/01/rdf-schema#subClassOf', u'http://purl.obolibrary.org/obo/MPATH_603'], ['http://purl.obolibrary.org/obo/MPATH_33', u'http://www.w3.org/2000/01/rdf-schema#subClassOf', u'http://purl.obolibrary.org/obo/MPATH_1']]

Which is clearly less awesome.

My algorithm generates a directed subgraph of all the nodes and edges it encounters during its search for a path between two ontology concepts. I figured generating this subgraph would make it easier to get some of the variables I need for the semantic similarity measurement calculation (such as amount changes in directions in the path, node’s depth from the root, etc.). Furthermore, I can use the subgraph to more easily assign weights to the textual data surrounding the nodes, when assembling my bag-of-words model of a node’s direct context, as I’ve explained in my previous post.

What to use

There are heaps of libraries for managing graphs in Python, and loads of programs to visualize and manipulate graphs. Here’s my stuff of choice.

After looking at several python-graph libraries ( NetworkX, igraph, graph-tool, etc.) I choose to use python-graph, which in my opinion is the most straightforward, compact, simple yet powerful graph-lib for python. I use python-graph to generate a file describing the subgraph in DOT language (“plain text graph description language”). This file can then be imported in a wide array of programs to visualize and manipulate the graph.

Visualizing the subgraph containing the shortest path between two nodes would allow me to get a better picture of what my algorithm fetches, and also to double-check the results (seeing is believing ;)). To visualize graphs, there are plenty of options again. After sifting through Tulip, Graphviz and some other obscure programs I stumbled upon Gephi, a very complete, pretty and simple open-source graph visualization & manipulation program. It has extensive documentation, and several advanced features to manipulate the graph and fetch some values. Ideally though, I will manage all those ‘advanced value-fetching tasks’ in my python script. Gephi still provides a nice and quick way to double-check some of the output and get a more concrete idea about what’s happening, as things can get pretty complex, pretty fast:

1136 nodes of the DOID-ontology, subgraph produced when finding a link for DOID_4 (disease) and DOID_10652 (Alzheimer's Disease) - red nodes in the graph. Linked by the three yellow nodes.