It’s a wrap!

13 Apr 12 in Blog thesis, views: 151, permalink

Photo by Thijs Niks

This category can soon be archived ;)! Earlier this week I handed in my final paper, and yesterday was the day of my final presentation. It was a great day and I’m really excited about embarking on my next adventure.

I will soon start as a PhD candidate at the University of Amsterdam, on a very exciting project in ‘Semantic Search in e-Discovery’ at the Information and Language Processing Systems group. Naturally, this blog will keep the world informed of my work and projects ;). Exciting times!

Paper

Download my paper: Automatic Annotation of Cyttron Entries using the NCIthesaurus [PDF - 328 KB]
Download the supplementary data (graphs, tables and viz): Supplementary Data [PDF - 2.27 MB]

Demo

Similarity graph demo

Check out the D3.js-powered demo of a similarity graph (comparing expert & computer-generated annotations)

» Read this post

text graphs

12 Apr 12 in thesis, views: 21, permalink

Measure and Visualize Semantic Similarity Between Subgraphs

19 Mar 12 in Blog thesis, views: 974, permalink

As I blogged previously, I am working on measuring the performance of my keyword extraction algorithms. The confusion matrix approach I have implemented is quite ‘harsh’. It ignores any semantic information and simply treats the concepts as words, and counts hits and misses between two sets of concepts.

To benefit from the semantic  information described in the NCI Thesaurus, and thus produce more detailed results, I will measure the algorithm’s performance by measuring the semantic similarity between the lists of concepts. The two lists (expert data & algorithm) are treated as subgraphs within the main graph: the NCI Thesaurus. Their similarity is measured with a path-based semantic similarity metric, of which there are several. I have implemented Leacock & Chodorow’s measure, as in the literature I found it consistently outperforms similar path-based metrics in the Biomedical domain. Speaking of domain; this measure has originally been designed for WordNet (as many of the other metrics), but has also been used and validated in the Biomedical domain. Hooray for domain-independent, unsupervised and corpus-free approaches to similarity measurement ;-). » Read this post

Abstract

10 Mar 12 in thesis, views: 43, permalink

Below the first draft of the abstract of my paper. It doesn’t yet include the results/conclusion. Word count: 127

Semantic annotation uses human knowledge formalized in ontologies to enrich texts, by providing structured and machine-understandable information of its content. This paper proposes an approach for automatically annotating texts of the Cyttron Scientific Image Database, using the NCI Thesaurus ontology. Several frequency-based keyword extraction algorithms, aiming to extract core concepts and exclude less relevant concepts, were implemented and evaluated. Furthermore, text classification algorithms were applied to identify important concepts which do not occur in the text. The algorithms were evaluated by comparing them to annotations provided by experts. Semantic networks were generated from these annotations and an ontology-based similarity metric was used to cross-compare them. Finally the networks were visualized to provide further insights into the differences of the semantic structure generated by humans, and the algorithms.

Tags: Semantic annotation, ontology-based semantic similarity, semantic networks, keyword extraction, text classification, network visualization, text mining

Algorithm performance measurement: Confusion Matrix

3 Mar 12 in thesis, views: 175, permalink

As I am starting to gather testing data, I figured it’d be a good time to determine how to measure the performance of the different results of my 24 text representation algorithms. What I want to measure per algorithm: how many keywords are predicted the same as the experts, and how many aren’t. After some research and valuable advice, I came across confusion matrices, which seemed appropriate. For each algorithm I measure the amount of:

Predicted
Negative Positive
Actual Negative A B
Positive
C
D

A. True Negatives (excluded by algorithm & excluded by experts)
B. False Positives (included by algorithm & excluded by experts)
C. False Negatives (excluded by algorithm & included by experts)
D. True Positives (included by algorithm & included by experts)

I found this page from the University of Regina explaining confusion matrices, and decided to implement it. My implementation in pseudocode:

For each text:

  fill expertList with expertResults #list of URIs
  fill algoList with algorithmResults #list of URIs

  algoPOS = algoList1 #URIs included by algorithm
  algoNEG = [item for item in NCIthesaurus if item not in algoPOS] #URIs excluded by algorithm (ontology-algoPOS)
  expertPOS = expertList1 #all URIs included by experts
  expertNEG = [item for item in NCIthesaurus if item not in expertPOS] #all URIs excluded by experts (ontology-expertPOS)

  A += len(set(algoNEG).intersection(expertNEG)) #True Negatives (number of URIs that overlap in algoNEG and expertNEG)
  B += len(set(algoPOS).intersection(expertNEG)) #False Positives (number of URIs that overlap of algoPOS and expertNEG)
  C += len(set(algoNEG).intersection(expertPOS)) #False Negatives (number of URIs that overlap in algoNEG and expertPOS)
  D += len(set(algoPOS).intersection(expertPOS)) #True Positives (number of URIs that overlap in algoPOS and expertPOS)

  matrix.append([[A,B],[C,D]]) #Put numbers in a matrix

With this information I can calculate a set of standard terms:
Accuracy (AC), Recall or True Positive Rate (TP), False Positive Rate (FP), True Negative Rate (TN), False Negative Rate (FN), Precision (P).

  AC = ((A+D) / (A+B+C+D))
  TP = ((D) / (C+D))
  FP = ((B) / (A+B))
  TN = ((A) / (A+B))
  FN = ((C) / (C+D))
  P = ((D) / (B+D))

The only thing with the rates are that proportions are heavily skewed because of the size of the ontology (90.000 URIs), which means the negative cases will always be much more frequent than the positives. This means that accuracy is always around 99.9%, and so is the TN. I still have to figure out exactly what information I want to use and how (visualize?).

Force-Directed Graphs: Playing around with D3.js

27 Feb 12 in Blog thesis, views: 5245, permalink

*Update: Newer example of Force-Directed d3.js Graph here: Measure and Visualize Semantic Similarity Between Subgraphs*

I recently replaced python-graph in my code with NetworkX, a slightly more sophisticated graph library for Python. Besides some more advanced algorithms for graph analysis (comparison, unison etc.) which can prove useful when analyzing data (comparing human data with mine, for example), I can also easily export my graphs to all kinds of formats. For example, to JSON. As I was getting a bit tired of GraphViz’ stubborn methods, and it’s far from dynamic approach, I decided to start playing around with the excellent Data Driven Documents JavaScript library, better known as D3.js, the successor to Protovis. Actually I had planned this quite a while ago, simply because I was impressed with the Force-directed Graph example on their website. I figured for coolness sake, I should implement them, instead of using the crummy GraphViz graphs.

So after a night and day of tinkering with the D3 code (starting from the Graph example included in the release, modifying stuff as I went) I came to this:

Click to play!

The red nodes are the concepts taken from the texts (either literal: filled red circles, or resulting from text classification: red donuts). The orange nodes are LCS-nodes (Lowest Common Subsumers), aka ‘parent’ nodes, and all the grey ones are simply in-between nodes (either for shortest paths between nodes, or parent nodes).

I added the labels, and also implemented zoom and panning functionality (mousewheel to zoom, click and drag to pan), included some metadata (hover with mouse over nodes to see their URI, over edges to see the relation). I am really impressed with the flexibility of D3, it’s amazing that I can now load up any random graph produced from my script, and  instantly see results.

The bigger plan is to make a fully interactive Graph, by starting with the ‘semantic similarity’ graph (where only the red nodes are displayed), and where clicking on edges expands the graph, by showing the relationship between two connected nodes. Semantic expansion at the click of a mouse ;)!

In other news

I’ve got a date for my graduation! If everything goes right, March 23rd is the day I’ll present my finished project. I’ll let you know once it’s final.

Not dead (yet)

10 Feb 12 in thesis, views: 27, permalink

While I haven’t been as active and hard working on my graduation project as I would have liked to be, I am not dead (nor the project). Earlier this week I presented my project to the Bio-imaging group of Leiden University, which helped me a lot. I was able to present my project pretty much as-is, since I’m mostly done with the technical parts. I received valuable feedback and got good insights into what I should explain more thoroughly in the presentation. » Read this post