David Graus

PhD Candidate Semantic Search in eDiscovery

Measure and Visualize Semantic Similarity Between Subgraphs

Monday, March 19, 2012
3,778 views
7 comments

As I blogged previously, I am working on measuring the performance of my keyword extraction algorithms. The confusion matrix approach I have implemented is quite ‘harsh’. It ignores any semantic information and simply treats the concepts as words, and counts hits and misses between two sets of concepts.

To benefit from the semantic  information described in the NCI Thesaurus, and thus produce more detailed results, I will measure the algorithm’s performance by measuring the semantic similarity between the lists of concepts. The two lists (expert data & algorithm) are treated as subgraphs within the main graph: the NCI Thesaurus. Their similarity is measured with a path-based semantic similarity metric, of which there are several. I have implemented Leacock & Chodorow’s measure, as in the literature I found it consistently outperforms similar path-based metrics in the Biomedical domain. Speaking of domain; this measure has originally been designed for WordNet (as many of the other metrics), but has also been used and validated in the Biomedical domain. Hooray for domain-independent, unsupervised and corpus-free approaches to similarity measurement ;-).

I evaluate an algorithm’s output by measuring the similarity of each node across the subgraphs of the expert and my algorithm (as shown in the figure right). By determining the average semantic similarity of the two graphs I can pick the best performing algorithm: the one which produces the semantically ‘closest’ subgraph.

For visualization purposes, I include the semantic similarity measurement between each node within a subgraph. This allows me to draw graphs where concepts are grouped by similarity: by mapping the similarity between two nodes to the edge’s (pull) force, a force-directed graph will group concepts by similarity; higher similarity means stronger pull. An example (created with the excellent d3.js library) below:

Blue nodes are from an expert.
Red nodes are from one of my algorithms.
The similarity between two nodes are mapped to:
- edge-length (short is most similar, long least)
- edge-color (white is most similar, black least)
The concept’s specifity is mapped to the size of the node (bigger is more specific).
Hovering over an edge displays the similarity between the two connected nodes.
Zoom in and out with mousewheel, pan by clicking and dragging.

This graph by itself already offers some quick insights, for example, the expert nodes are more specific in this graph, the expert nodes are semantically more similar (only Sagittal Plane and MRI are outside the main cluster). But only if these kind of patterns persist among different experts and texts, they could provide valuable insights in how to fine-tune the keyword extraction algorithms, and what specific information to take into account. I found that visualizing helps a lot in identifying these kind of patterns. On to analysis!

  • Pingback: graus.nu: Force-Directed Graphs: Playing around with D3.js

  • http://www.clementlevallois.net/ Clement Levallois

    hey david! I have to learn from you on this similarity work. One question sof rnow: why don’t you use stopwords? in the graph here, there are too many of them, making a lot of noise in your data?

    • dvdgrs

      This particular algorithm which created the blue nodes used the literal text, without any keyword extraction. I just took it as an example, it’s performing quite bad indeed… 

  • Clement Levallois

    two other comments:
    - cool blogging platform! WHat is it?
    - writing comments in black on a dark grey background is hard! Something to change maybe?

    • dvdgrs

      It’s WordPress!
      You’re right. Stupid Disqus comment system doesnt allow color changing in text area :/. I made the comment box lighter instead :-).
      BTW I’m planning to write a small WordNet synset similarity tutorial some time. It uses the same metric for similarity measurement as my ontology-based approach (but mine takes hours to find a path between concepts, NLTK has all the wordNet synset similarities stored somewhere I think, it’s very cool!).

  • http://twitter.com/jonathanstray jonathanstray

    Hi David. Really nice to run into your work!  Seems we are playing with a lot of similar concepts in different domains. In my document set analysis problem I don’t really have any ontology available — in fact part of the point of the system is to build an ontology/categorization system for the specific document sets.

    Anyway I thought I would point you to a few references that have really influenced my thinking. I’m deliberately choosing references that may be challenging or contradictory to the work you’ve done, not because your work isn’t good, but because maybe you’re not aware of these references. 

    Why do we necessarily believe that formal, hierarchical categorization systems are the best?
    http://www.shirky.com/writings/ontology_overrated.html
    http://www.well.com/~doctorow/metacrap.htm

    How well does cosine similarity match up with with human notions of document similarity?
    http://jonathanstray.com/papers/EmpiricalEvalTextSimilarity.pdf

    Does unsupervised clustering really capture “meaning” and “topics” uniquely when there are a combinatorically huge number of possible categorizations?
    http://gking.harvard.edu/files/abs/discov-abs.shtml

    A very old paper that, perhaps, conceptually ties your edge- and node-based similarity together. This is basic search engine research… from the 1950s!
    http://www.phil-fak.uni-duesseldorf.de/fileadmin/Redaktion/Institute/Informationswissenschaft/downloadcenter/infocenter/Informationretrieval/Luhn_1957_statistical_approach.pdf

    I’ve learned about a bunch of things I didn’t know before from reading your blog, keep it up!

    • dvdgrs

      That’s great, thanks for those refs! 

      As far as domain is concerned, I would really like  to explore how my current methods can be ported to a more general domain when I’m finished. (This is why I’ve kept corpus-dependent methods to a minimum; actually only my TF-IDF topic classification approach relies on a corpus). I like the idea of a corpus-free, domain independent system for annotation.

      But general/broad ontologies are indeed hard to come by. I was experimenting with the DBPedia ontology earlier, it’s huge, but it could be an interesting ontology to experiment with (or wikipedia: http://edgar.meij.pro/adding-semantics-microblogs/ ). 

      Another potentially interesting way to add semantics to simple text processing could be by using the WordNet OWL/RDF representation (
      http://www.w3.org/2006/03/wn/wn20/ ). To explore relations between identified words. 

      Anyway, those are just future ideas. First, me and the NCI thesaurus have a job to finish ;-).