David Graus

#OccupyAmsterdam wordle

Wordle van de 200 meest voorkomende woorden in tweets met hashtag #OccupyAmsterdam. Gemaakt van 5.239 tweets van tussen zaterdag 8 oktober 09:55 uur en 16 oktober 15:50 uur.
Handmatig gefilterd op nicknames en nietszeggende woorden. Hier is de lijst van de 1000 meest voorkomende woorden: OccupyAmsterdam-woorden.

More text-mining. Popularity contest: Drosophila Melanogaster vs. C. elegans

While waiting on several word-counting scripts to finish counting, I picked up my cancerCounter script to count something else. This time, I wanted to see what organism was more popular and more frequently mentioned in biomedical studies: the ever-present Drosophila Melanogaster, aka common fruit fly, or the aptly named Caenorhabditis elegans (one cannot deny that the 1nm-long worm has quite the elegant wiggle). Two model organisms in biomedical research.

Both have a lot going for themselves:
– Elegans was the first organism to ever have its entire genome sequenced (go worm!)
– The worm reproduces and mutates quickly and easily

The fruit fly on the other hand is quite the suitable lab-rat as well:
– Drosophila breeds easily
– Does not need much space nor care
– Has to pay for invading my kitchen each year during summer

I started counting the occurrence of ‘drosophila melanogaster’ or ‘d. melanogaster’ AND ‘caenorhabditis elegans’ or ‘c. elegans’ in the lowercased article-body of my 99.000-and-something BioMedCentral articles-corpus, and took a looksy. First comes the total amount of articles published a year, with the amount of articles mentioning the fruit fly/worm:

As we can see, worryingly, scientists hardly spend enough time performing research with worms and fruit flies. Since 2003, they do consistently play more with the worms than with fruit flies, though. But it’s hard to see, let’s ditch the total articles:

When we subtract the drosophila articles from the elegans articles, we can see how much the worm has on the fruit fly. The red bars represents by how many articles Elegans wins over Drosophila, and blue bars indicate with how many articles Drosophila wins over Elegans.

But absolute numbers is not what we’re looking for. As we have seen in the first graph, the frequency of articles is far from evenly distributed. So let’s see what the ratio is, of the difference between both organisms:

This evens out some of the bigger differences in the previous graph; Drosophila had ‘only’ a +5 win over Elegans in 2001, but relatively this is a bigger victory than Elegans’ +34 win in 2006, and even its +79 victory in 2009.

Conclusion: Elegans wins.

Meta Media project

Metamedia project from dvdgrs on Vimeo.

‘Reclame’ our Public Space. A collective art-work in Leiden. By Peter Curet, Nisaar Jagroep, Veneta ‘Andersen’ Vassileva, David Graus.

Mentioned @ golancourses.net
And in the local newspaper Leidsch Dagblad:

Complaints in the Cloud: Online Complaint Behavior on Twitter

Complaints in the Cloud was the final project for ‘Creative Research’, a course by Maarten Lamers and Bas Haring, as part of the Media Technology MSc. programme’s curriculum, in 2009. Together with Barry Borsboom and René Coenen I tried to find a correlation between complaining behavior on Twitter, and a ‘real word situation’.

Abstract:

Does Twitter represent the state of affairs in the
real world? To research this, we created a dataset
consisting of user-generated delays gathered from the
social network system Twitter[2], and information on
delays acquired through de Nederlandse Spoorwegen's
RSS feed on delays[3] of the first two weeks of
November. Our approach is motivated by the key
observation that when people get bored, they tend to
grab their mobile phone to kill time. Certain Twitter
search queries show there are a lot of people using
twitter in or around a  train(station), usually a place
were people are either waiting or traveling.
The analysis based on our dataset reveals that in
general, amount of Twitter-complaints coincide with
the duration and number of delays. Where the value of
one is high, the other generally is as well. Thus based
on the data at hand, we can conclude that there is in
fact a correlation between the reported delays and
online complaints on Twitter. Unfortunately we didn't
succeed in pointing out a specific relation between the
trajectories and amount of complaints, but this might
well be because of the scope of our research.

Download full paper here: DavidGraus-BarryBorsboom-ReneCoenen_CreativeResearch.pdf [142kb]

Academic PDF Reader

The Academic PDF Reader is a project I did together with Bertram Bourdrez for the Human Computer Interaction course, part of the Media Technology curriculum, in 2009. It is an exploration on a new way of displaying and interacting with PDF documents, specifically intended for scientific papers. For the project, I researched the specific process of reading a scientific paper. Based on this research I conceptualized some interaction and design principles.

The main finding is that academic papers are read primarily in a non-serial, scanning fashion, and that the readers generally have an accurate mental model of the document’s structure. To better support this non-serial reading behaviour, we designed a PDF reader with a dual way of presenting documents – on the one hand a horizontal, ‘zoomed-out’ view. Providing a structured overview of the document to offer a good overview of an article’s structure, and support the reader’s mental model of the document. And on the other hand, a more classic ‘reading’ mode, serial reading.

The project consisted out of conceptualizing, researching, designing and developing a novel HCI application, and perform user tests to further evaluate our prototype.

The abstract of our paper:

University students read a large number of scientific articles during their stud-
ies. Choosing to read digital texts directly from the computer screen as opposed
to printing them first can be a time and money saving decision. Experienced
readers of academic articles use a similar approach to reading academic docu-
ments: in a non-serial fashion and by knowing the similar structure these art-
icles generally share. We find current PDF readers are insufficiently capable to
support this method of reading. Our PDF reader proposes to support reading
academic papers better, by translating beneficent properties of reading from
physical paper to the display of digital texts. As the reading method applied by
experienced readers is non-serial, it is important to be able to quickly navigate
through pages and get a clear overview of the text's structure. Our PDF reader
offers an alternative method of displaying digital texts, to optimally support the
reading of academic articles.

Download the full paper here: PDF [181kb]

The Real Internet Globalizer

The Real Internet Globalizer is a concept for an internet browser widget, designed together with Barry Borsboom. The widget aims to actively contribute to a more globalized internet. It is designed with three principles in mind:

Creating awareness of the geographical size of the user’s ‘personal internet’
Actively contributing in expanding this geographically confined internet
Providing feedback of the progress made in the geographical size of the internet to the user

For the primary feature, The Real Internet Globalizer gathers geographical data of all the news websites the user visits. It displays the distribution of the different countries visited in an infographic on a map of the world, to provide instant insight as to where the user surfs most frequently. For the second feature, TRIG will find and ‘suggest’ similar content to the users, from parts of the world where the user normally doesn’t surf. This will make the user able to judge if other countries provide other points of view. By following the suggestions, the user will have a more globalized surfing behaviour.

The Real Internet Globalizer was inspired by a talk by Ethan Zuckerman during the Cloud Intelligence Symposium at the Ars Electronica Festival in 2009.

Download the paper describing the Real Internet Globalizer here: BarryBorsboom_DavidGraus-TheRealInternetGlobalizer.pdf [266kb]

Computing string similarity with TF-IDF and Python

“The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.”[wikipedia]

It is also the weight I use to measure similarity between texts, for these two tasks of my thesis project (click for pic!):

– Step 3: measure the similarity of a cyttron-db entry to a concept-description from an ontology. This will allow me to find concepts in the text that do not appear literally.
– Step 5: to be able to relate concepts which come from different ontologies. By measuring how similar the text surrounding a concept found in the text is compared to another found concept.

As mentioned before, I am using the excellent Gensim “vector space modelling for humans” package, which takes all the complicated mathematics off my hands (like the scary and intimidating formula up top!). Perfect for me, as I’m not mathematician, nor a computational linguist, nor a statistician, but I AM a human, who wants to work with a solid and proven method of similarity measures and feature extraction for texts. Since I am what I am, I won’t attempt to explain any of the inner workings of Bag-of-word models, vector space, and TF-IDF measures, sorry, there are much better places for that. I’ll simply show how I made Gensim work for me (assuming it does).

The first step is to create a training corpus. The training corpus defines the features of the text – the words that will be considered ‘important’ when looking at a text. The training corpus needs to be from the same domain as the target application: in my case the biomedical domain.

At first I was looking at extracting a bunch of relevant Wikipedia articles (all articles from Wikipedia’s Biology category) to use as a training corpus. But then I came across something better: the Open Access BioMed Central full-text corpus. The corpus consists of over 100.000 articles, weighing in at 8GB of XML-documents.

I wrote a simple script using lxml2 to parse the individual files: extracting all plaintext from the article body, cleaning them and storing them in a new text-file (1 article per line) for later processing. The cleaning process consists out of 3 steps: tokenizing articles (aka breaking an article up in words), filtering for common stopwords, and finally stemming the remaining words. I chose to include stemming, in order to unify such words as ‘hippocampal’ and ‘hippocampus’ (stemming returns the ‘root’ of a word). As I stem both the training corpus and the strings that need to be compared, it is not a disaster if words get stemmed incorrectly: in the end I don’t need to make sense out of the stemmed words, I only need them for counting. The plaintext file my script created is 650MB (vs 8,8GB for the uncompressed XML-files)!

The cleaning of the article is pretty straightforward, using pre-cooked NLTK modules: the WordPunct tokenizer, set of English stopwords and NLTK’s implementation of the Porter stemmer. For the quality of the similarity measurement it is important to follow the exact same cleaning procedure with the strings I want to compare – I use the same function for both the corpus-preparation as that of the comparison strings:

def cleanDoc(doc):
    stopset = set(stopwords.words('english'))
    stemmer = nltk.PorterStemmer()
    tokens = WordPunctTokenizer().tokenize(doc)
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) &amp;gt; 2]
    final = [stemmer.stem(word) for word in clean]
    return final

Creating a training corpus in Gensim

Gensim‘s documentation is very extensive, and I can recommend going through the tutorials if you want to get an idea of the possibilities. But I couldn’t find much documentation on how to do simple string-to-string comparisons, so I wrote down what I did (and errrm yes, it’s pretty much exactly the same as string-to-index querying you can find in the Gensim tutorials :p):

1. Create a ‘dictionary’ of the training corpus’ raw text:

The dictionary contains words:frequency mappings and will be used to convert texts to vector space at a later stage:

&amp;gt;&amp;gt;&amp;gt; dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
&amp;gt;&amp;gt;&amp;gt; print dictionary
Dictionary(1049403 unique tokens)

2. Convert the training corpus to vector space:

class MyCorpus(object):
    def __iter__(self):
        for line in open('corpus.txt'):
            yield dictionary.doc2bow(line.lower().split())
&amp;gt;&amp;gt;&amp;gt; corpus = MyCorpus()
&amp;gt;&amp;gt;&amp;gt; corpora.MmCorpus.serialize('corpus.mm', corpus)&amp;nbsp;# Save corpus to disk
&amp;gt;&amp;gt;&amp;gt; corpus = corpora.MmCorpus('corpus.mm') # Load corpus
&amp;gt;&amp;gt;&amp;gt; print corpus
MmCorpus(99432 documents, 1049403 features, 39172124 non-zero entries)

3. Initialize the TF-IDF model:

&amp;gt;&amp;gt;&amp;gt; tfidf = models.TfidfModel(corpus)
&amp;gt;&amp;gt;&amp;gt; print tfidf
TfidfModel(num_docs=99432, num_nnz=39172124)

Thankfully it’s possible to store the generated corpus, dictionary and tfidf to disk: parsing all these documents takes quite a while on my computer. That’s it for the preparation of the training corpus!

Comparing two strings

Now whenever I want to compare two strings, using features gathered from the training corpus, I need to:

Clean both strings in the same way I cleaned the articles in the corpus (NLTK stopword-filter + tokenization) » cleanDoc(string)
Convert both strings to vector-space using the dictionary generated from the training corpus » dictionary.doc2bow(string)
Convert both vector-space representations of the strings to TF-IDF space, using the TF-IDF model initialized earlier » tfidf[string]

When both strings are prepared, all is left to compare them, by creating an ‘index’ (the reference string) and a ‘query’ (the other string). Order doesn’t matter.

index = similarities.MatrixSimilarity([tfidf1],num_features=len(dictionary))
sim = index[tfidf2]
print str(round(sim*100,2))+'% similar'

Resulting in, for example, the comparison of the description of “Alzheimer’s disease” and “Cognitive disease” in the Human Disease (DOID) ontology:

&amp;gt;&amp;gt;&amp;gt; compareDoc("""A dementia that results in progressive memory loss, impaired thinking,
disorientation, and changes in personality and mood starting in late middle age and leads
in advanced cases to a profound decline in cognitive and physical functioning and is marked
histologically by the degeneration of brain neurons especially in the cerebral cortex and
by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is
characterized by memory lapses, confusion, emotional instability and progressive loss of
mental ability.""","""A disease of mental health that affects cognitive functions including
memory processing, perception and problem solving.""")

23.29% similar

Or another example: the Wikipedia article of “Alzheimer’s disease” compared to the ontology description of “Alzheimer’s disease”:

&amp;gt;&amp;gt;&amp;gt; wikiGet('alzheimer')
alzheimer in wikiTxt
&amp;gt;&amp;gt;&amp;gt; compareDoc(wikiTxt,"""A dementia that results in progressive memory loss, impaired thinking,
disorientation, and changes in personality and mood starting in late middle age and leads in
advanced cases to a profound decline in cognitive and physical functioning and is marked
histologically by the degeneration of brain neurons especially in the cerebral cortex and by
the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized
by memory lapses, confusion, emotional instability and progressive loss of mental ability.""")

31.95% similar

Final example: the top 5 most similar ontology concepts to the Wikipedia page of “Alzheimer’s disease”:

 &amp;gt;&amp;gt;&amp;gt; descMatch(wikiAlz)
Label: Alzheimer's disease
Similarity: 31.9990843534

Label: vascular dementia
Similarity: 28.0893445015

Label: amyloid deposition
Similarity: 25.6860613823

Label: cognitive disease
Similarity: 18.7662974

Label: dementia
Similarity: 18.0801317096

Now the second task (of matching a string to all the descriptions from my ontologies is much the same process, with the only difference that I need to use the similarities.Similarity object when creating the index (of the descriptions): the MatrixSimilarity object resides fully in RAM, the Similarity object on disk.

I am pretty confident about these preliminary results. It all seems to work as it should, and should be much more robust than my earlier attempts at similarity measurement using difflib and some crummy homegrown keyword-extraction and comparison (which I will still use for generating synonyms, crumminess works for that).

Direction flip counter

I think I just created a functional direction-flips counter for the directed graph that my SPARQL-powered ontology-pathFinder produces :)).


&amp;gt;&amp;gt;&amp;gt; path = [['drie','&amp;gt;','vijf'],['vijf','&amp;gt;','negen'],['zeven','&amp;gt;','negen'],['zeven','&amp;gt;','acht'],['acht','&amp;gt;','twaalf'],['negentien','&amp;gt;','twaalf']]
&amp;gt;&amp;gt;&amp;gt; findFlips(path,'drie','negentien')
drie vijf
vijf negen
up
zeven negen
down
zeven acht
acht twaalf
up
negentien twaalf
down
3 flips

It seems to work correctly on this incredibly tricky test-path I gave it ;).

Textmining BioMedCentral: Cancer – a trending topic?

*Update*
I added a graph which shows the ratio of articles containing the word ‘Cancer’ to total articles per year. It sadly still suffers from the incomplete data of earlier years:

*Original post*

This is my first attempt to get some data to get some data out of the BioMedCentral dataset, the freely available, Open Access archive of over 40 years of Biomedical research articles. I’ll use this set as a training corpus for my thesis, to extract domain-specific features to use when comparing the similarity between two documents. The dataset consists out of 103.782 articles from 1969 to today.

My text-mining experiment was a very simple one: count the occurrence of the word ‘cancer’ in every article of the journal. My expectation was that the term would occur more frequently as time progresses: as a science journalist I frequently came across (obscure) biomedical research which concluded its findings by in some way linking to (promising a potential way to discover a potential cure for:) cancer. I always figured it had to do with funding. But I’m no expert.

Anyway, to test this I threw together a simple Python script to parse each (xml-formatted) article and extract its date and the frequency of the word cancer, and output this data to a csv-file. I averaged the amount of counts per year per article. Resulting in the following graph:

I hoped to be able to provide an overview of the frequency of the word in ~40 years of BMC. I wasn’t. The first couple of years seem very incomplete: there aren’t many articles (in the hundreds instead of in the thousands as in later years), and lots of “(To access the full article, please see PDF)”-references (yay to Open Access). Anyway, I figured the last 10 years WERE okay, so I graphed the average occurrence of the word cancer of those last couple of years.

Some initial thoughts:

The average (word count per article) might be the wrong metric here. Articles dedicated to cancer-related topics skew the average too much. I am actually looking for the papers which do not contain the word frequently.
A better metric could be the ratio of articles that DO contain the word (at least once). I’ll give that a shot later and update this post.
There does seem to be some increase in occurrence, however I wouldn’t say it’s enough to support my observation.