More text-mining. Popularity contest: Drosophila Melanogaster vs. C. elegans


While waiting on several word-counting scripts to finish counting, I picked up my cancerCounter script to count something else. This time, I wanted to see what organism was more popular and more frequently mentioned in biomedical studies: the ever-present Drosophila Melanogaster, aka common fruit fly, or the aptly named Caenorhabditis elegans (one cannot deny that the 1nm-long worm has quite the elegant wiggle). Two model organisms in biomedical research.

Both have a lot going for themselves:
– Elegans was the first organism to ever have its entire genome sequenced (go worm!)
– The worm reproduces and mutates quickly and easily

The fruit fly on the other hand is quite the suitable lab-rat as well:
– Drosophila breeds easily
– Does not need much space nor care
– Has to pay for invading my kitchen each year during summer

I started counting the occurrence of ‘drosophila melanogaster’ or ‘d. melanogaster’ AND ‘caenorhabditis elegans’ or ‘c. elegans’ in the lowercased article-body of my 99.000-and-something BioMedCentral articles-corpus, and took a looksy. First comes the total amount of articles published a year, with the amount of articles mentioning the fruit fly/worm:

As we can see, worryingly, scientists hardly spend enough time performing research with worms and fruit flies. Since 2003, they do consistently play more with the worms than with fruit flies, though. But it’s hard to see, let’s ditch the total articles:

When we subtract the drosophila articles from the elegans articles, we can see how much the worm has on the fruit fly. The red bars represents by how many articles Elegans wins over Drosophila, and blue bars indicate with how many articles Drosophila wins over Elegans.

But absolute numbers is not what we’re looking for. As we have seen in the first graph, the frequency of articles is far from evenly distributed. So let’s see what the ratio is, of the difference between both organisms:

This evens out some of the bigger differences in the previous graph; Drosophila had ‘only’ a +5 win over Elegans in 2001, but relatively this is a bigger victory than Elegans’ +34 win in 2006, and even its +79 victory in 2009.

Conclusion: Elegans wins.

Textmining BioMedCentral: Cancer – a trending topic?

I added a graph which shows the ratio of articles containing the word ‘Cancer’ to total articles per year. It sadly still suffers from the incomplete data of earlier years:

*Original post*

This is my first attempt to get some data to get some data out of the BioMedCentral dataset, the freely available, Open Access archive of over 40 years of Biomedical research articles. I’ll use this set as a training corpus for my thesis, to extract domain-specific features to use when comparing the similarity between two documents. The dataset consists out of 103.782 articles from 1969 to today.

My text-mining experiment was a very simple one: count the occurrence of the word ‘cancer’ in every article of the journal. My expectation was that the term would occur more frequently as time progresses: as a science journalist I frequently came across (obscure) biomedical research which concluded its findings by in some way linking to (promising a potential way to discover a potential cure for:) cancer. I always figured it had to do with funding. But I’m no expert.

Anyway, to test this I threw together a simple Python script to parse each (xml-formatted) article and extract its date and the frequency of the word cancer, and output this data to a csv-file. I averaged the amount of counts per year per article. Resulting in the following graph:

I hoped to be able to provide an overview of the frequency of the word in ~40 years of BMC. I wasn’t. The first couple of years seem very incomplete: there aren’t many articles (in the hundreds instead of in the thousands as in later years), and lots of “(To access the full article, please see PDF)”-references (yay to Open Access). Anyway, I figured the last 10 years WERE okay, so I graphed the average occurrence of the word cancer of those last couple of years.

Some initial thoughts:

  • The average (word count per article) might be the wrong metric here. Articles dedicated to cancer-related topics skew the average too much. I am actually looking for the papers which do not contain the word frequently.
  • A better metric could be the ratio of articles that DO contain the word (at least once). I’ll give that a shot later and update this post.
  • There does seem to be some increase in occurrence, however I wouldn’t say it’s enough to support my observation.