As I blogged previously, I am working on measuring the performance of my keyword extraction algorithms. The confusion matrix approach I have implemented is quite ‘harsh’. It ignores any semantic information and simply treats the concepts as words, and counts hits and misses between two sets of concepts.
To benefit from the semantic information described in the NCI Thesaurus, and thus produce more detailed results, I will measure the algorithm’s performance by measuring the semantic similarity between the lists of concepts. The two lists (expert data & algorithm) are treated as subgraphs within the main graph: the NCI Thesaurus. Their similarity is measured with a path-based semantic similarity metric, of which there are several. I have implemented Leacock & Chodorow’s measure, as in the literature I found it consistently outperforms similar path-based metrics in the Biomedical domain. Speaking of domain; this measure has originally been designed for WordNet (as many of the other metrics), but has also been used and validated in the Biomedical domain. Hooray for domain-independent, unsupervised and corpus-free approaches to similarity measurement ;-). (more…)
The first steps towards creating graph-based text visualizations is thinking about how I would like to visually represent the structure of the information I extract from texts. I have included a preliminary example of the output at the bottom of this page. I use three different methods of extracting concepts from a piece of text:
Counting literal occurrences of concepts (from ontologies)
Finding related concepts by textual comparison of the text to the concepts’ descriptions
Finding related concepts by exploring the ontological structure (aka relating concepts within one ontology by finding paths and parents, and possibly relevant neighbours)
The primary distinction I want to make is that of relevance (aka ‘likeliness of topic’). In the case of 1 this would be the frequency of the word (more occurrences = more relevant concept), in the case of 2 this is the calculated similarity of the source text to the concepts’ descriptions (which is a number between 0 and 1). In the case of 3, this would be by ‘connections’: the more concepts the concepts I find by exploring an ontologies’ structure would link, the more relevant this found concept would be. I want to model this distinction by node’s size: the more relevant a concept is, the bigger I want to draw the concept in the graph.
The second distinction is that of literal to non-literal (1 being literal, 2 and 3 being non-literal). I want to model this distinction by style: literal concepts will be drawn as a filled circle, non-literal concepts as outlined circles.
The third distinction is that of the concepts’ source: from which ontology does a concept originate? This distinction will be modeled by color: each ontology (of the six I use) will have its distinct color. Explored concepts (from step 3) such as parents and shared parents will be colored in distinct colours as well, since they are connected in the graph to the coloured nodes, their source will implicitly be clear.
Since Gephi doesn’t fully support Graphviz’ DOT-language, and the graph library I use in Python conveniently parses graphs in DOT, I use Graphviz to directly render the results.
An issue I came across with the scaling (to represent relevance), is that I’m working with multiple measures: frequency of literal words (1), percentage of text-similarity (2), and degree-count. In an effort to roughly equalize the scaling factor, I decided to use static counters. Each node gets an initial size of 0.5 (0.4 for shared parents, and 0.3 for parents). For each occurrence of a literal word, I add 0.05. For the text-similarity, I add the percentage of 0.5 (26% similarity = 0.5 + (0.5*0.26)). For the degree, I add 0.1 for each in-link the node receives. This is an initial attempt at unifying the results. Anyway, these are just settings I’m playing around with.
These are very rough examples, because:
I use the literal representation of the input text (as I have not yet determined the most suitable keyword extraction method)
I haven’t determined a proper ‘cut-off’ for the text similarity measure; currently it includes every concept it finds with similarity ≥ 25%.
It doesn’t yet fully incorporate step 3 (it includes parents, but not yet paths between nodes)
It doesn’t scale nodes according to in-links
There is no filtering applied yet (removing obsolete classes, for example).
All this time I have been working on my ontology-powered topic identification system: combining semantic web technologies with natural language processing and text-mining techniques to extract (a) topic(s) from a text. But I never really decided what my program should output: a topic? A list of potential topics with their ‘likeliness’? That would mean the relationships between topics my SPARQL-powered ‘pathfinder’ finds would disappear in an algorithm that used those paths to calculate semantic similarity, which would be a pitty. This weekend it finally all came together:
I already played around with drawing graphs with Gephi, as a method to check the results in a way other than processing huge lists in my python interpreter. But now I realize it could be the perfect ‘final product’. What I want to create is a ‘semantically-augmented tag-cloud-graph‘. As opposed to a standard tagcloud, my augmented tag graph will be a visualization of the text, composed of both interlinked concepts and solitary concepts, concepts that can be found literally in the text and concepts that aren’t mentioned anywhere in the text. The tag graph will benefit from the linked data nature of ontologies to:
1. Show relationships between concepts
2. Show extra concepts which do not occur literally in the text: nodes that occur in a path between two nodes, and maybe ‘superClass’ and ‘subClass’ nodes.
In this way, it will be similar to a tag cloud as it conveys the content of a text, but augmented as it could convey the meaning of a text and relationships between tags. It will also be similar to the DBPedia RelFinder, but augmented as it will show links but also contain a hierarchy of more to less important concepts. I hope that in the end it could be a viable alternative of graphic text-representation.
As I see it, my semantic graph-cloud should communicate at least these three things:
1. The most likely topic of the text
2. The concepts which occur literally in the text versus concepts that do not occur in the text
3. Clusters of similar concepts
My first idea is to model these three properties by size, alpha-channel and colour respectively, the bigger node, the more likely the topic, transparant nodes are the ones that do not occur in the text, and coloured nodes to group semantically similar nodes. But that’s just my initial plan, I might have to give it some more thoughts and experiment with it.
Next, I should think about a method to ‘measure’ whether my semantic tagcloud conveys more information, or has ajy added value. In the end, I am totally unsure whether the resulting tag graphs will make sense as it depends on multiple factors such as the efficiency of my keyword extraction, the efficiency of the vector-space based string comparison, the quality of the ontologies, etc.
The good news is, most of the technical work is in a close-to-finished state. I have various methods of extracting keywords from large texts (which I will be validating soon), I have a functional method to find ‘new’ concepts by comparing a text to all the descriptions of ontology concepts, I have a functional shortest-path finder to explore how two concepts are related (and produce a graph of it). It’s just a matter of putting it all together and selecting a suitable tool to draw the graphs. I don’t think I’ll use Gephi, as I want to fully integrate the graph drawing in my script, so who knows, maybe it’s back to NetworkX, igraph, or maybe Protovis?
In the mean time I’m looking at validating the data I generated with the various keyword extraction algorithms. By using human experts and cross-validating with existing keywords from certain entries, I’ll be able to evaluate which method of keyword extraction is the most efficient for my purposes. Once that’s done, and I’ve picked my graph-drawing method, I can start showing some preliminary results!
To my right is a visualization of the output of my SPARQL-powered shortest path algorithm, finding a link between ‘intracellular and extracellular accumulation’ & ‘developmental and adult structural defect’, 2 concepts in the Mouse Pathology ontology. Click it! It shows the two ‘source’ concepts in white, and the shortest path (of 3 nodes: 4 hops) in grey. It looks like this in python:
My algorithm generates a directed subgraph of all the nodes and edges it encounters during its search for a path between two ontology concepts. I figured generating this subgraph would make it easier to get some of the variables I need for the semantic similarity measurement calculation (such as amount changes in directions in the path, node’s depth from the root, etc.). Furthermore, I can use the subgraph to more easily assign weights to the textual data surrounding the nodes, when assembling my bag-of-words model of a node’s direct context, as I’ve explained in my previous post.
What to use
There are heaps of libraries for managing graphs in Python, and loads of programs to visualize and manipulate graphs. Here’s my stuff of choice.
After looking at several python-graph libraries ( NetworkX, igraph, graph-tool, etc.) I choose to use python-graph, which in my opinion is the most straightforward, compact, simple yet powerful graph-lib for python. I use python-graph to generate a file describing the subgraph in DOT language (“plain text graph description language”). This file can then be imported in a wide array of programs to visualize and manipulate the graph.
Visualizing the subgraph containing the shortest path between two nodes would allow me to get a better picture of what my algorithm fetches, and also to double-check the results (seeing is believing ;)). To visualize graphs, there are plenty of options again. After sifting through Tulip, Graphviz and some other obscure programs I stumbled upon Gephi, a very complete, pretty and simple open-source graph visualization & manipulation program. It has extensive documentation, and several advanced features to manipulate the graph and fetch some values. Ideally though, I will manage all those ‘advanced value-fetching tasks’ in my python script. Gephi still provides a nice and quick way to double-check some of the output and get a more concrete idea about what’s happening, as things can get pretty complex, pretty fast:
1136 nodes of the DOID-ontology, subgraph produced when finding a link for DOID_4 (disease) and DOID_10652 (Alzheimer's Disease) - red nodes in the graph. Linked by the three yellow nodes.