spaceship EYE
dvdgrs [graus.nu] posted a photo:
dvdgrs [graus.nu] posted a photo:
dvdgrs [graus.nu] posted a photo:
dvdgrs [graus.nu] posted a photo:
I have started a Dutch blog on datamining, as I haven’t really come across one, and figured the time is right, with the increase in interest in datajournalism, dataviz, big data, etc. The idea is to provide a general and human-understandable overview of the (academic) field of datamining and the innovations.
I will still use this blog to keep the world informed of my personal endeavours ;-).

This category can soon be archived ;)! Earlier this week I handed in my final paper, and yesterday was the day of my final presentation. It was a great day and I’m really excited about embarking on my next adventure. I will soon start as a PhD candidate at the University of Amsterdam, on a very exciting project in ‘Semantic Search in e-Discovery’ at the Information and Language Processing Systems group. Naturally, this blog will keep the world informed of my work and projects ;). Exciting times!
Download my paper: Automatic Annotation of Cyttron Entries using the NCIthesaurus [PDF – 328 KB] Download the supplementary data (graphs, tables and viz): Supplementary Data [PDF – 2.27 MB]
Check out the D3.js-powered demo of a similarity graph (comparing expert & computer-generated annotations)
Continue reading “It’s a wrap!”“(a) A sagittal reconstruction of a coronally acquired magnetic resonance imaging (MRI) scan, at the level on which the cingulate gyrus was measured. The area outlined represents the portion of the scan used to orient the operator to the landmarks of the cingulate. A box has been placed over the region of interest in one hemisphere. (b) A diagram of the cingulate gyrus divided into the rostral portion of the anterior cingulate (RAC), the caudal portion of the interior cingulate (CAC), and the posterior cingulate (PC). Adjoining landmarks include the corpus callosum (CC), the lateral ventricle (Lat. Vent.), and the thalamus (Thal.). (c) The region of the cingulate gyrus measured in the present study, as delineated on the MRI scan of a control subject. […] ” (snippet)
Two sets of annotations (Expert 1 & Expert 3)
Result in the following similarity graph:
dvdgrs [graus.nu] posted a photo:
Moon through Nikkor 70-300mm@300mm… Bit noisy
dvdgrs [graus.nu] posted a photo:
dvdgrs [graus.nu] posted a photo:
in ‘t Twiske
dvdgrs [graus.nu] posted a photo:
in ‘t Twiske
dvdgrs [graus.nu] posted a photo:
in ‘t Twiske

zo gaat dat hier in Tuindorp
As I blogged previously, I am working on measuring the performance of my keyword extraction algorithms. The confusion matrix approach I have implemented is quite ‘harsh’. It ignores any semantic information and simply treats the concepts as words, and counts hits and misses between two sets of concepts.
To benefit from the semantic information described in the NCI Thesaurus, and thus produce more detailed results, I will measure the algorithm’s performance by measuring the semantic similarity between the lists of concepts. The two lists (expert data & algorithm) are treated as subgraphs within the main graph: the NCI Thesaurus. Their similarity is measured with a path-based semantic similarity metric, of which there are several. I have implemented Leacock & Chodorow’s measure, as in the literature I found it consistently outperforms similar path-based metrics in the Biomedical domain. Speaking of domain; this measure has originally been designed for WordNet (as many of the other metrics), but has also been used and validated in the Biomedical domain. Hooray for domain-independent, unsupervised and corpus-free approaches to similarity measurement ;-). Continue reading “Measure and Visualize Semantic Similarity Between Subgraphs”
Below the first draft of the abstract of my paper. It doesn’t yet include the results/conclusion. Word count: 127
Semantic annotation uses human knowledge formalized in ontologies to enrich texts, by providing structured and machine-understandable information of its content. This paper proposes an approach for automatically annotating texts of the Cyttron Scientific Image Database, using the NCI Thesaurus ontology. Several frequency-based keyword extraction algorithms, aiming to extract core concepts and exclude less relevant concepts, were implemented and evaluated. Furthermore, text classification algorithms were applied to identify important concepts which do not occur in the text. The algorithms were evaluated by comparing them to annotations provided by experts. Semantic networks were generated from these annotations and an ontology-based similarity metric was used to cross-compare them. Finally the networks were visualized to provide further insights into the differences of the semantic structure generated by humans, and the algorithms.
Tags: Semantic annotation, ontology-based semantic similarity, semantic networks, keyword extraction, text classification, network visualization, text mining