Measure and Visualize Semantic Similarity Between Subgraphs

As I blogged previously, I am working on measuring the performance of my keyword extraction algorithms. The confusion matrix approach I have implemented is quite ‘harsh’. It ignores any semantic information and simply treats the concepts as words, and counts hits and misses between two sets of concepts.

To benefit from the semantic  information described in the NCI Thesaurus, and thus produce more detailed results, I will measure the algorithm’s performance by measuring the semantic similarity between the lists of concepts. The two lists (expert data & algorithm) are treated as subgraphs within the main graph: the NCI Thesaurus. Their similarity is measured with a path-based semantic similarity metric, of which there are several. I have implemented Leacock & Chodorow’s measure, as in the literature I found it consistently outperforms similar path-based metrics in the Biomedical domain. Speaking of domain; this measure has originally been designed for WordNet (as many of the other metrics), but has also been used and validated in the Biomedical domain. Hooray for domain-independent, unsupervised and corpus-free approaches to similarity measurement ;-).

I evaluate an algorithm’s output by measuring the similarity of each node across the subgraphs of the expert and my algorithm (as shown in the figure right). By determining the average semantic similarity of the two graphs I can pick the best performing algorithm: the one which produces the semantically ‘closest’ subgraph.

For visualization purposes, I include the semantic similarity measurement between each node within a subgraph. This allows me to draw graphs where concepts are grouped by similarity: by mapping the similarity between two nodes to the edge’s (pull) force, a force-directed graph will group concepts by similarity; higher similarity means stronger pull. An example (created with the excellent d3.js library) below:

Blue nodes are from an expert.
Red nodes are from one of my algorithms.
The similarity between two nodes are mapped to:
– edge-length (short is most similar, long least)
– edge-color (white is most similar, black least)
The concept’s specifity is mapped to the size of the node (bigger is more specific).
Hovering over an edge displays the similarity between the two connected nodes.
Zoom in and out with mousewheel, pan by clicking and dragging.

This graph by itself already offers some quick insights, for example, the expert nodes are more specific in this graph, the expert nodes are semantically more similar (only Sagittal Plane and MRI are outside the main cluster). But only if these kind of patterns persist among different experts and texts, they could provide valuable insights in how to fine-tune the keyword extraction algorithms, and what specific information to take into account. I found that visualizing helps a lot in identifying these kind of patterns. On to analysis!

9 thoughts on “Measure and Visualize Semantic Similarity Between Subgraphs”

  1. hey david! I have to learn from you on this similarity work. One question sof rnow: why don’t you use stopwords? in the graph here, there are too many of them, making a lot of noise in your data?

    1. This particular algorithm which created the blue nodes used the literal text, without any keyword extraction. I just took it as an example, it’s performing quite bad indeed… 

  2. two other comments:
    – cool blogging platform! WHat is it?
    – writing comments in black on a dark grey background is hard! Something to change maybe?

    1. It’s WordPress!
      You’re right. Stupid Disqus comment system doesnt allow color changing in text area :/. I made the comment box lighter instead :-).
      BTW I’m planning to write a small WordNet synset similarity tutorial some time. It uses the same metric for similarity measurement as my ontology-based approach (but mine takes hours to find a path between concepts, NLTK has all the wordNet synset similarities stored somewhere I think, it’s very cool!).

  3. Hi David. Really nice to run into your work!  Seems we are playing with a lot of similar concepts in different domains. In my document set analysis problem I don’t really have any ontology available — in fact part of the point of the system is to build an ontology/categorization system for the specific document sets.

    Anyway I thought I would point you to a few references that have really influenced my thinking. I’m deliberately choosing references that may be challenging or contradictory to the work you’ve done, not because your work isn’t good, but because maybe you’re not aware of these references. 

    Why do we necessarily believe that formal, hierarchical categorization systems are the best?
    http://www.shirky.com/writings/ontology_overrated.html
    http://www.well.com/~doctorow/metacrap.htm

    How well does cosine similarity match up with with human notions of document similarity?
    http://jonathanstray.com/papers/EmpiricalEvalTextSimilarity.pdf

    Does unsupervised clustering really capture “meaning” and “topics” uniquely when there are a combinatorically huge number of possible categorizations?
    http://gking.harvard.edu/files/abs/discov-abs.shtml

    A very old paper that, perhaps, conceptually ties your edge- and node-based similarity together. This is basic search engine research… from the 1950s!
    http://www.phil-fak.uni-duesseldorf.de/fileadmin/Redaktion/Institute/Informationswissenschaft/downloadcenter/infocenter/Informationretrieval/Luhn_1957_statistical_approach.pdf

    I’ve learned about a bunch of things I didn’t know before from reading your blog, keep it up!

    1. That’s great, thanks for those refs! 

      As far as domain is concerned, I would really like  to explore how my current methods can be ported to a more general domain when I’m finished. (This is why I’ve kept corpus-dependent methods to a minimum; actually only my TF-IDF topic classification approach relies on a corpus). I like the idea of a corpus-free, domain independent system for annotation.

      But general/broad ontologies are indeed hard to come by. I was experimenting with the DBPedia ontology earlier, it’s huge, but it could be an interesting ontology to experiment with (or wikipedia: http://edgar.meij.pro/adding-semantics-microblogs/ ). 

      Another potentially interesting way to add semantics to simple text processing could be by using the WordNet OWL/RDF representation (
      http://www.w3.org/2006/03/wn/wn20/ ). To explore relations between identified words. 

      Anyway, those are just future ideas. First, me and the NCI thesaurus have a job to finish ;-).

  4. Hi!
    I really appreciate your work.. actually i’m a research master student and I work in the ontology field and I have some part in which i need to visualize ontology elements according to semantic similarity between them.. I didnt find the source code of your project in your Github so I wanted to know if it’s possible to send me your source code it might help me in my work and give me a good idea.

    Thank you so much and good luck

Leave a Reply