Below the first draft of the abstract of my paper. It doesn’t yet include the results/conclusion. Word count: 127

Semantic annotation uses human knowledge formalized in ontologies to enrich texts, by providing structured and machine-understandable information of its content. This paper proposes an approach for automatically annotating texts of the Cyttron Scientific Image Database, using the NCI Thesaurus ontology. Several frequency-based keyword extraction algorithms, aiming to extract core concepts and exclude less relevant concepts, were implemented and evaluated. Furthermore, text classification algorithms were applied to identify important concepts which do not occur in the text. The algorithms were evaluated by comparing them to annotations provided by experts. Semantic networks were generated from these annotations and an ontology-based similarity metric was used to cross-compare them. Finally the networks were visualized to provide further insights into the differences of the semantic structure generated by humans, and the algorithms.

Tags: Semantic annotation, ontology-based semantic similarity, semantic networks, keyword extraction, text classification, network visualization, text mining

Geomapping the Bible and Herman Melville’s Moby Dick

For a small dataviz experiment I wanted to create maps of books, by extracting locations (cities, countries, continents, whatever is mentioned in the text) and drawing these on a map. I used the Stanford Named Entity Recognizer to extract the locations from two books: the Bible and Herman Melville’s Moby Dick. I then wrote a small script in python to retrieve the latitude and longitude of the locations using the Google Geocoding API, throw it all in a csv-file and draw it on a map using GeoCommons. I also included an ascending date to the locations, in order to allow an animated visualization of the extracted locations in GeoCommons.

The darker a circle, the more mentions it got (I set the circles opacity to 10%, so overlaying circles automatically darken).  There were some issues regarding false positives (Stanford NER identifying persons as locations). And while I didn’t really know what to expect, I was glad to see that the major clusters in both maps did seem to make sense (Nantucket in Moby Dick, around Jerusalem in the Bible). The Bible geomap shows that a lot of places (particularly in the United States) seem to be named after Biblical locations and names. The cluster in the West Coast of the US seems as big as the Middle Eastern cluster, however once you zoom in it becomes clear that it is less tightly packed. Moby Dick’s geomap shows a lot of locations around coastal areas, which seems to make sense, it also mentions a lot of oceans and seas.

Continue reading “Geomapping the Bible and Herman Melville’s Moby Dick”

#OccupyAmsterdam wordle

Wordle van de 200 meest voorkomende woorden in tweets met hashtag #OccupyAmsterdam. Gemaakt van 5.239 tweets van tussen zaterdag 8 oktober 09:55 uur en 16 oktober 15:50 uur.
Handmatig gefilterd op nicknames en nietszeggende woorden. Hier is de lijst van de 1000 meest voorkomende woorden: OccupyAmsterdam-woorden.

More text-mining. Popularity contest: Drosophila Melanogaster vs. C. elegans


While waiting on several word-counting scripts to finish counting, I picked up my cancerCounter script to count something else. This time, I wanted to see what organism was more popular and more frequently mentioned in biomedical studies: the ever-present Drosophila Melanogaster, aka common fruit fly, or the aptly named Caenorhabditis elegans (one cannot deny that the 1nm-long worm has quite the elegant wiggle). Two model organisms in biomedical research.

Both have a lot going for themselves:
– Elegans was the first organism to ever have its entire genome sequenced (go worm!)
– The worm reproduces and mutates quickly and easily

The fruit fly on the other hand is quite the suitable lab-rat as well:
– Drosophila breeds easily
– Does not need much space nor care
– Has to pay for invading my kitchen each year during summer

I started counting the occurrence of ‘drosophila melanogaster’ or ‘d. melanogaster’ AND ‘caenorhabditis elegans’ or ‘c. elegans’ in the lowercased article-body of my 99.000-and-something BioMedCentral articles-corpus, and took a looksy. First comes the total amount of articles published a year, with the amount of articles mentioning the fruit fly/worm:

As we can see, worryingly, scientists hardly spend enough time performing research with worms and fruit flies. Since 2003, they do consistently play more with the worms than with fruit flies, though. But it’s hard to see, let’s ditch the total articles:

When we subtract the drosophila articles from the elegans articles, we can see how much the worm has on the fruit fly. The red bars represents by how many articles Elegans wins over Drosophila, and blue bars indicate with how many articles Drosophila wins over Elegans.

But absolute numbers is not what we’re looking for. As we have seen in the first graph, the frequency of articles is far from evenly distributed. So let’s see what the ratio is, of the difference between both organisms:

This evens out some of the bigger differences in the previous graph; Drosophila had ‘only’ a +5 win over Elegans in 2001, but relatively this is a bigger victory than Elegans’ +34 win in 2006, and even its +79 victory in 2009.

Conclusion: Elegans wins.

Textmining BioMedCentral: Cancer – a trending topic?

I added a graph which shows the ratio of articles containing the word ‘Cancer’ to total articles per year. It sadly still suffers from the incomplete data of earlier years:

*Original post*

This is my first attempt to get some data to get some data out of the BioMedCentral dataset, the freely available, Open Access archive of over 40 years of Biomedical research articles. I’ll use this set as a training corpus for my thesis, to extract domain-specific features to use when comparing the similarity between two documents. The dataset consists out of 103.782 articles from 1969 to today.

My text-mining experiment was a very simple one: count the occurrence of the word ‘cancer’ in every article of the journal. My expectation was that the term would occur more frequently as time progresses: as a science journalist I frequently came across (obscure) biomedical research which concluded its findings by in some way linking to (promising a potential way to discover a potential cure for:) cancer. I always figured it had to do with funding. But I’m no expert.

Anyway, to test this I threw together a simple Python script to parse each (xml-formatted) article and extract its date and the frequency of the word cancer, and output this data to a csv-file. I averaged the amount of counts per year per article. Resulting in the following graph:

I hoped to be able to provide an overview of the frequency of the word in ~40 years of BMC. I wasn’t. The first couple of years seem very incomplete: there aren’t many articles (in the hundreds instead of in the thousands as in later years), and lots of “(To access the full article, please see PDF)”-references (yay to Open Access). Anyway, I figured the last 10 years WERE okay, so I graphed the average occurrence of the word cancer of those last couple of years.

Some initial thoughts:

  • The average (word count per article) might be the wrong metric here. Articles dedicated to cancer-related topics skew the average too much. I am actually looking for the papers which do not contain the word frequently.
  • A better metric could be the ratio of articles that DO contain the word (at least once). I’ll give that a shot later and update this post.
  • There does seem to be some increase in occurrence, however I wouldn’t say it’s enough to support my observation.