As I blogged previously, I am working on measuring the performance of my keyword extraction algorithms. The confusion matrix approach I have implemented is quite ‘harsh’. It ignores any semantic information and simply treats the concepts as words, and counts hits and misses between two sets of concepts.
To benefit from the semantic information described in the NCI Thesaurus, and thus produce more detailed results, I will measure the algorithm’s performance by measuring the semantic similarity between the lists of concepts. The two lists (expert data & algorithm) are treated as subgraphs within the main graph: the NCI Thesaurus. Their similarity is measured with a path-based semantic similarity metric, of which there are several. I have implemented Leacock & Chodorow’s measure, as in the literature I found it consistently outperforms similar path-based metrics in the Biomedical domain. Speaking of domain; this measure has originally been designed for WordNet (as many of the other metrics), but has also been used and validated in the Biomedical domain. Hooray for domain-independent, unsupervised and corpus-free approaches to similarity measurement ;-). (more…)