More text-mining. Popularity contest: Drosophila Melanogaster vs. C. elegans

 vs 

While waiting on several word-counting scripts to finish counting, I picked up my cancerCounter script to count something else. This time, I wanted to see what organism was more popular and more frequently mentioned in biomedical studies: the ever-present Drosophila Melanogaster, aka common fruit fly, or the aptly named Caenorhabditis elegans (one cannot deny that the 1nm-long worm has quite the elegant wiggle). Two model organisms in biomedical research.

Both have a lot going for themselves:
– Elegans was the first organism to ever have its entire genome sequenced (go worm!)
– The worm reproduces and mutates quickly and easily

The fruit fly on the other hand is quite the suitable lab-rat as well:
– Drosophila breeds easily
– Does not need much space nor care
– Has to pay for invading my kitchen each year during summer

I started counting the occurrence of ‘drosophila melanogaster’ or ‘d. melanogaster’ AND ‘caenorhabditis elegans’ or ‘c. elegans’ in the lowercased article-body of my 99.000-and-something BioMedCentral articles-corpus, and took a looksy. First comes the total amount of articles published a year, with the amount of articles mentioning the fruit fly/worm:

As we can see, worryingly, scientists hardly spend enough time performing research with worms and fruit flies. Since 2003, they do consistently play more with the worms than with fruit flies, though. But it’s hard to see, let’s ditch the total articles:

When we subtract the drosophila articles from the elegans articles, we can see how much the worm has on the fruit fly. The red bars represents by how many articles Elegans wins over Drosophila, and blue bars indicate with how many articles Drosophila wins over Elegans.

But absolute numbers is not what we’re looking for. As we have seen in the first graph, the frequency of articles is far from evenly distributed. So let’s see what the ratio is, of the difference between both organisms:

This evens out some of the bigger differences in the previous graph; Drosophila had ‘only’ a +5 win over Elegans in 2001, but relatively this is a bigger victory than Elegans’ +34 win in 2006, and even its +79 victory in 2009.

Conclusion: Elegans wins.

Complaints in the Cloud: Online Complaint Behavior on Twitter

Complaints in the Cloud was the final project for ‘Creative Research’, a course by Maarten Lamers and Bas Haring, as part of the Media Technology MSc. programme’s curriculum, in 2009. Together with Barry Borsboom and René Coenen I tried to find a correlation between complaining behavior on Twitter, and a ‘real word situation’.

Abstract:

Does Twitter represent the state of affairs in the
real world? To research this, we created a dataset
consisting of user-generated delays gathered from the
social network system Twitter[2], and information on
delays acquired through de Nederlandse Spoorwegen's
RSS feed on delays[3] of the first two weeks of
November. Our approach is motivated by the key
observation that when people get bored, they tend to
grab their mobile phone to kill time. Certain Twitter
search queries show there are a lot of people using
twitter in or around a  train(station), usually a place
were people are either waiting or traveling.
The analysis based on our dataset reveals that in
general, amount of Twitter-complaints coincide with
the duration and number of delays. Where the value of
one is high, the other generally is as well. Thus based
on the data at hand, we can conclude that there is in
fact a correlation between the reported delays and
online complaints on Twitter. Unfortunately we didn't
succeed in pointing out a specific relation between the
trajectories and amount of complaints, but this might
well be because of the scope of our research.

Download full paper here: DavidGraus-BarryBorsboom-ReneCoenen_CreativeResearch.pdf [142kb]

Academic PDF Reader

The Academic PDF Reader is a project I did together with Bertram Bourdrez for the Human Computer Interaction course, part of the Media Technology curriculum, in 2009. It is an exploration on a new way of displaying and interacting with PDF documents, specifically intended for scientific papers. For the project, I researched the specific process of reading a scientific paper. Based on this research I conceptualized some interaction and design principles.

The main finding is that academic papers are read primarily in a non-serial, scanning fashion, and that the readers generally have an accurate mental model of the document’s structure. To better support this non-serial reading behaviour, we designed a PDF reader with a dual way of presenting documents – on the one hand a horizontal, ‘zoomed-out’ view. Providing a structured overview of the document to offer a good overview of an article’s structure, and support the reader’s mental model of the document. And on the other hand, a more classic ‘reading’ mode, serial reading.

The project consisted out of conceptualizing, researching, designing and developing a novel HCI application, and perform user tests to further evaluate our prototype.

The abstract of our paper:

University students read a large number of scientific articles during their stud-
ies. Choosing to read digital texts directly from the computer screen as opposed
to printing them first can be a time and money saving decision. Experienced
readers of academic articles use a similar approach to reading academic docu-
ments: in a non-serial fashion and by knowing the similar structure these art-
icles generally share. We find current PDF readers are insufficiently capable to
support this method of reading. Our PDF reader proposes to support reading
academic papers better, by translating beneficent properties of reading from
physical paper to the display of digital texts. As the reading method applied by
experienced readers is non-serial, it is important to be able to quickly navigate
through pages and get a clear overview of the text's structure. Our PDF reader
offers an alternative method of displaying digital texts, to optimally support the
reading of academic articles.

Download the full paper here: PDF [181kb]

The Real Internet Globalizer

The Real Internet Globalizer is a concept for an internet browser widget, designed together with Barry Borsboom. The widget aims to actively contribute to a more globalized internet. It is designed with three principles in mind:

  1. Creating awareness of the geographical size of the user’s ‘personal internet’
  2. Actively contributing in expanding this geographically confined internet
  3. Providing feedback of the progress made in the geographical size of the internet to the user

For the primary feature, The Real Internet Globalizer gathers geographical data of all the news websites the user visits. It displays the distribution of the different countries visited in an infographic on a map of the world, to provide instant insight as to where the user surfs most frequently. For the second feature, TRIG will find and ‘suggest’ similar content to the users, from parts of the world where the user normally doesn’t surf. This will make the user able to judge if other countries provide other points of view. By following the suggestions, the user will have a more globalized surfing behaviour.

The Real Internet Globalizer was inspired by a talk by Ethan Zuckerman during the Cloud Intelligence Symposium at the Ars Electronica Festival in 2009.

Download the paper describing the Real Internet Globalizer here: BarryBorsboom_DavidGraus-TheRealInternetGlobalizer.pdf [266kb]

Direction flip counter

image

I think I just created a functional direction-flips counter for the directed graph that my SPARQL-powered ontology-pathFinder produces :)).


>>> path = [['drie','>','vijf'],['vijf','>','negen'],['zeven','>','negen'],['zeven','>','acht'],['acht','>','twaalf'],['negentien','>','twaalf']]
>>> findFlips(path,'drie','negentien')
drie vijf
vijf negen
up
zeven negen
down
zeven acht
acht twaalf
up
negentien twaalf
down
3 flips

It seems to work correctly on this incredibly tricky test-path I gave it ;).

Textmining BioMedCentral: Cancer – a trending topic?

*Update*
I added a graph which shows the ratio of articles containing the word ‘Cancer’ to total articles per year. It sadly still suffers from the incomplete data of earlier years:

*Original post*

This is my first attempt to get some data to get some data out of the BioMedCentral dataset, the freely available, Open Access archive of over 40 years of Biomedical research articles. I’ll use this set as a training corpus for my thesis, to extract domain-specific features to use when comparing the similarity between two documents. The dataset consists out of 103.782 articles from 1969 to today.

My text-mining experiment was a very simple one: count the occurrence of the word ‘cancer’ in every article of the journal. My expectation was that the term would occur more frequently as time progresses: as a science journalist I frequently came across (obscure) biomedical research which concluded its findings by in some way linking to (promising a potential way to discover a potential cure for:) cancer. I always figured it had to do with funding. But I’m no expert.

Anyway, to test this I threw together a simple Python script to parse each (xml-formatted) article and extract its date and the frequency of the word cancer, and output this data to a csv-file. I averaged the amount of counts per year per article. Resulting in the following graph:

I hoped to be able to provide an overview of the frequency of the word in ~40 years of BMC. I wasn’t. The first couple of years seem very incomplete: there aren’t many articles (in the hundreds instead of in the thousands as in later years), and lots of “(To access the full article, please see PDF)”-references (yay to Open Access). Anyway, I figured the last 10 years WERE okay, so I graphed the average occurrence of the word cancer of those last couple of years.

Some initial thoughts:

  • The average (word count per article) might be the wrong metric here. Articles dedicated to cancer-related topics skew the average too much. I am actually looking for the papers which do not contain the word frequently.
  • A better metric could be the ratio of articles that DO contain the word (at least once). I’ll give that a shot later and update this post.
  • There does seem to be some increase in occurrence, however I wouldn’t say it’s enough to support my observation.

Simple keyword extraction in Python: choices, choices.

As explained in an earlier post, I am working on a simple method of extracting ‘important words’ from a text-entry. The methods I am using at the moment are frequency distributions and word collocations. I’ve bumped into some issues regarding finetuning my methods. Read on for a short explanation of my approaches, and some issues regarding them.

Frequency Distribution: POS-tagging y/n?

Extracting keywords by frequency distribution is nothing more than counting words and sorting the list of words by occurrence. Before doing this, I filter stopwords from the text entry. The short explanation on how I’m doing this (sourcecode available at github):

» Tokenize the text (using NLTK’s WordPunctTokenizer)
» Lowercase all the words
» ‘Clean’ the list by removing common stopwords from the list (using NLTK’s English stopwords-list)

This is straightforward enough, an example of the results (from the WikiPedia page of ‘Apoptosis‘):

>>> cyttron.wikiGet('Apoptosis')
Apoptosis in wikiTxt
>>> freqWords(cyttron.wikiTxt,15)
['apoptosis', 'cell', '160', 'apoptotic', 'cells', 'caspase', 'death',
'.&#', 'proteins', 'tnf', 'bcl', 'protein', 'also', 'caspases']

Earlier I was thinking about using POS-tagging (Part-Of-Speech tagging to identify word-types) in order to only  extract frequently occurring nouns. I figured losing relevant adjectives (such as ‘red’ in red blood cell) could be compensated by the word collocations extraction. POS-tagging the tokenized text, and retrieving only the most frequent nouns results in:

>>> freqNouns(cyttron.wikiTxt,15)
['apoptosis', 'cell', 'caspase', 'death', 'protein', 'tnf', 'pathway',
'activation', 'membrane', 'p53', 'response', 'family', 'gene', 'greek']

My problem here is I’m not sure which is ‘better’ (if any of those two), or if I should maybe use a combination of both. Also, I haven’t decided yet how to handle non-alphabetic words. Initially I planned on using regular expressions to filter non-alphabetic strings, but I figured later that it wouldn’t make sense in my case. In the above example, this would omit retrieving ‘p53’: a tumor suppressor protein (P53), which is very relevant.

With earlier playing around with POS-tagging I noticed the precision was not quite high enough to perform chunk extractions (by looking for specific phrases / grammatical constructions). Extracting only nouns does seem to do quite the job, even if I still miss some and get some false positives.

Word Collocations: Stopword filtering y/n?

Collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. I generate bi- and trigram word collocations, which mean ‘2-word strings’ and ‘3-word strings’. My issue here is whether or not to use stopword filtering. Here are the results of the word collocation function on the same WikiPedia page, the 1st list being the bigram collocations, the 2nd being the trigrams. Example without stopword filtering:

>>> wordCollo(cyttron.wikiTxt,10,clean=False)
['such as', 'cell death', 'of the', 'due to', 'leads to', 'programmed cell',
'has been', 'bone marrow', 'have been', 'an increase']
['adp ribose polymerase', 'amino acid composition', 'anatomist walther flemming',
'boston biologist robert', 'break itself down', 'combining forms preceded',
'count falls below', 'german scientist carl', 'homologous antagonist killer',
'mdm2 complexes displaces']

Example with stopword filtering:

>>> wordCollo(cyttron.wikiTxt,10,clean=True)
['cell death', 'programmed cell', 'bone marrow', 'university aberdeen',
'calcium concentration', 'adenovirus e1b', 'british journal', 'citation needed',
'highly conserved', 'nitric oxide']
['adp ribose polymerase', 'agar gel electrophoresis', 'amino acid composition',
'anatomist walther flemming', 'appearance agar gel', 'awarded sydney brenner',
'boston biologist robert', 'carl vogt first', 'ceases respire aerobically',
'closely enough warrant']

As you can see, lots of garbage in the first example, but still some collocations that do not appear in the cleaned version. Similar to the noun-extraction issue with the previous approach, I wonder if I should choose for one of the two, or combine them.

In other news

Finding Gensim has been a life-saver! Instead of using Difflib to compare two strings, I now use a proper text-similarity metric, namely cosine similarity measurement. I do so by creating a TF-IDF weighted corpus out of the (stopwords-cleaned) descriptions of ontology-terms I use, and calculating the cosine similarity between an input string and each entry in the corpus. Gensim makes this all a breeze to do. An example of the ouput:

>>> wikiGet('alzheimer')
alzheimer in wikiTxt
>>> descMatch(wikiTxt,5)
Label: Alzheimer's disease
Similarity: 0.236387
Description: A dementia that results in progressive memory loss, impaired thinking, disorientation, and changes in personality and mood starting in late middle age and leads in advanced cases to a profound decline in cognitive and physical functioning and is marked histologically by the degeneration of brain neurons especially in the cerebral cortex and by the presence of neurofibrillary tangles and plaques containing beta-amyloid. It is characterized by memory lapses, confusion, emotional instability and progressive loss of mental ability.

Label: vascular dementia
Similarity: 0.192565
Description: A dementia that involves impairments in cognitive function caused by problems in blood vessels that feed the brain.

Label: dementia
Similarity: 0.157553
Description: A cognitive disease resulting from a loss of brain function affecting memory, thinking, language, judgement and behavior.

Label: cognitive disease
Similarity: 0.13909
Description: A disease of mental health that affects cognitive functions including memory processing, perception and problem solving.

Label: encephalitis
Similarity: 0.138719
Description: Encephalitis is a nervous system infectious disease characterized as an acute inflammation of the brain. The usual cause is a viral infection, but bacteria can also cause it. Cases can range from mild to severe. For mild cases, you could have flu-like symptoms. Serious cases can cause severe headache, sudden fever, drowsiness, vomiting, confusion and seizures.

I’m not sure if the similarity numbers it produces indicate I’m doing something wrong (there’s no high similarity), but intuitively I would say the results do make sense.

Ontology-based semantic similarity measurements: an overview

My thesis is about keyword extraction of biological notes, using semantic ‘dictionaries’ called ontologies. These ontologies are large networks, where each node stands for a concept, and each connection between nodes for relations. See the picture on the right for a visual representation of an ontology.

To identify the subject of a text, I need to see what terms that are described in an ontology appear in a text. This leaves me with multiple concepts, of which I need to find the ‘common denominator’. To do this, I have to measure the similarity (or inversely: the distance) between two concepts: if I find a bunch of very similar concepts in one text, I can be more confident of the subject.

Luckily, a lot of people have dealt with this ‘ontology-based semantic similarity measurement’. I gathered and studied a couple of papers, and provide a quick overview of my findings. See my literature list for a more complete overview.

DISCLAIMER: This is by no means intended to be an exhaustive overview. It’s short. I’m sure I’ve not read every relevant paper. Due to time constraints my priority is finding a suitable method to carry on and checking to see if the direction I’m heading is OK (it seems that way). My overview deals with global approaches only, nothing too specific. If I’ve missed anything really obvious, I’d be grateful if you could leave a comment :).

There are two main approaches in ontology-based semantic similarity measurement: edge-based (also called structural or hierarchical approach) and node-based (also called information-content approach).

Edge-based

Edge-based approaches take the structure of the network as a base, focussing on the connections between nodes and their implications/meanings. In edge-based semantic similarity measurement, there are three main principles (which are fortunately pretty much globally agreed-upon – at least in the papers I found):

Shortest-path length between nodes
The most direct approach: the closer two nodes are in the network, the more similar they are. Important detail: path-length is measured by counting (only!) the nodes which have a ‘is_a’ or ‘part_of’ relation. The most primitive semantic similarity measures use only path lengths. However, this shortest-path measure can be extended with:

Node ‘depth’ (aka specifity)
The deeper a node is (farther away from the root), the more specific it is. In most papers this does not revolve around individual node’s depths, but around the depth of their Least Common Subsumer (LCS). The LCS is the deepest ‘shared parent’ of two nodes. The depth is defined as the amount of nodes the LCS is separated from the root concept. The deeper the LCS is, the more similar the concepts are. Also, the granularity of an individual concept has to be considered in calculating its specifity (more granular means more ‘subdivisions’, means a more specific concept). This is usually modeled as an extra variable that influences the concept’s specifity. This means that a highly granular node will be less similar to a less granular node.

Link’s direction
Ontologies are directed graphs: a connection between two nodes has a direction (chair is_a furniture, does not work the other way around). The more changes in direction the path has between two nodes, the less similar the nodes are.

Node-based

Node-based measures do not take the connections in the network as a main resource, but rather the information inside and surrounding the nodes. Text-mining and textual analysis techniques can be applied here. For example by comparing both concepts’ textual data, the concept’s contexts, or by comparing the similarity of the concepts’ LCS to the individual concepts. In these cases, a node or node’s context is frequently represented as a ‘bag of words‘, disregarding any form of grammar or semantics. Cosine similarity is a measure which is often used for textually comparing two texts. A common approach to weigh the importance of words in a text is the TF-IDF (term frequency–inverse document frequency) measure.

Other techniques involve counting the amount of surrounding nodes (in a way similar to checking a node’s granularity), the depth of a node (counting the amount of hops from the root-node), etc. The way I see it, a node-based approach is a useful extension on an edge-based approach.

Both these approaches (edge & node-based) are applied in a multitude of algorithms combining edge & node-based, or dealing with either one of the two. For an overview of some common algorithms and their use of edge- and/or node-based approaches, I highly recommend [1].

Now what?

What’s left for me is to formulate an approach: picking an edge-based similarity measurement algorithm and implementing it, and finding a node-based approach to extend the edge-based approach with.

For the edge-based algorithm, the bare essentials are already in place:

  • A breadth-first search algorithm to determine paths between nodes
  • A method of finding the common parent (LCS) of two nodes
  • A method of counting the depth of a node
  • A method of exploring the context of a node

Then, I am looking to extend the structure-based approach by including a similarity-comparison of the linguistic data ‘surrounding’ each node. By retrieving all surrounding nodes of both nodes that I’m comparing, throwing all textual data of the surrounding nodes in a ‘bag of words’, and comparing this bag of words to the second node’s bag of words. This is also called a node-based similarity measure (as opposed to the previously described ‘edge-based’ measure). I will also look into combining this text-comparison system with keyword extraction.

An extremely useful Python framework for all text-comparison tasks at hand (BOW model, cosine similarity measures, TF-IDF) I came across is Gensim. It has all the features I could want to use, plus excellent documentation.

“Gensim is a Python framework designed to automatically extract semantic topics from documents, as naturally and painlessly as possible.”

Literature

  1. Pesquita C, Faria D, Falcão AO, Lord P, Couto FM (2009) Semantic Similarity in Biomedical Ontologies. PLoS Comput Biol 5(7): e1000443. doi:10.1371/journal.pcbi.1000443
  2. Al-Mubaid H, Nguyen HA. “A cluster-based approach for semantic similarity in the biomedical domain”, Conf Proc IEEE Eng Med Biol Soc. 2006;1:2713-7.
  3. A. Bramantoro, S. Krishnaswamy, and M. Indrawan, “A Semantic Distance Measure for Matching Web Services”, in Proceedings of WISE Workshops 2005. pp.217~226
  4. E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proc. IJCAI, 2007
  5. M. Andrea Rodríguez, Max J. Egenhofer, “Determining Semantic Similarity among Entity Classes from Different Ontologies,” IEEE Transactions on Knowledge and Data Engineering, pp. 442-456, March/April, 2003
  6. Lee, W N et al. “Comparison of ontology-based semantic-similarity measures.” AMIA Annu Symp Proc 2008 (2008) : 384-388.
  7. I. Spasic, S. Ananiadou, J. McNaught, and A. Kumar, “Text mining and ontologies in biomedicine: Making sense of raw text”,  presented at Briefings in Bioinformatics, 2005, pp.239-251.
  8. P. Resnik. “Using information content to evaluate semantic similarity in a taxonomy”, In Proceedings of the 14th international joint conference on Artificial intelligence – Volume 1 (IJCAI’95), pp. 448-453. 1995.
  9. R. Thiagarajan, G. Manjunath, and M. Stumptner. “Computing Semantic Similarity Using Ontologies”, HP Labs, 2008

(sorry for the messy list)

DBPedia Twitterbot: Introducing @grausPi!

12/12/12 update: since @sem_web moved to live in my Raspberry Pi, I’ve renamed him @grausPi

The last couple of days I’ve spent working on my graduation project by working on a side-project: @sem_web; a Twitter-bot who queries DBPedia [wikipedia’s ‘linked data’ equivalent] for knowledge.

@sem_web is able to recognize 249 concepts, defined by the DBPedia ontology, and sends SPARQL queries to the DBPedia endpoint to retrieve more specific information about them. Currently, this means that @sem_web can check an incoming tweet (mention) for known concepts, and then return an instance (example) of the concept, along with a property of this instance, and the value for the property. An example of Sam’s output:

[findConcept] findConcept('video game')
[findConcept] Looking for concept: video game
[tweet] [u'http://dbpedia.org/class/yago/ComputerGame100458890', 
'video game']

[findInst] Seed: [u'http://dbpedia.org/class/yago/ComputerGame100458890', 
'video game']
[findInst] Has 367 instances.
[findInst] Instance: Fight Night Round 3

[findProp] Has 11 properties.
[findProp] [u'http://dbpedia.org/property/platforms', u'platforms']

[findVal] Property: platforms (has 1 values)
[findVal] Value: Xbox 360, Xbox, PSP, PS2, PS3
[findVal] Domain: [u'Thing', u'work', u'software']
[findVal] We're talking about a thing...
Fight Night Round 3 is a video game. Its platforms is Xbox 360, Xbox, 
PSP, PS2, PS3.

This is how it works:

  1. Look for words occurring in the tweet that match a given concept’s label.
  2. If found (concept): send a SPARQL query to retrieve an instance of the concept (an object with rdf:type concept).
  3. If not found: send a SPARQL query to retrieve a subClass of the concept. Go to step 1 with subClass as concept.
  4. If found (instance): send SPARQL queries to retrieve a property, value and domain of the instance. The domain is used to determine whether @sem_web is talking about a human or a thing.
  5. If no property with a value is found after several tries: Go to step 2 to retrieve a new instance.
  6. Compose a sentence (currently @sem_web has 4 different sentences) with the information (concept, instance, property, value).
  7. Tweet!

Next to that, @sem_web posts random tweets once an hour, by picking a random concept from the DBPedia ontology. Working on @sem_web allows me to get to grips with both the SPARQL query language, and programming in Python (which, still, is something I haven’t done before in a larger-than-20-lines-of-code way).

Comparing concepts

What I’m working on next is a method to compare multiple concepts, when @sem_web detects more than one in a tweet. Currently, this works by taking each concept and querying for all the superClasses of the concept. I then store the path from the seed to the topClass (Entity) in a list, repeat the process for the next concept, and then compare both paths to the top, to identify a common parent-Class.

This is relevant for my graduation project as well, because a large task in determining the right subject for a text will be to determine the ‘proximity’ or similarity of different concepts in the text. Still, that specific task of determining ‘similarity’ or proximity of concepts is a much bigger thing, finding common superClasses is just a tiny step towards it. There are other interesting relationships to explore, for example partOf/sameAs relations. I’m curious to see what kind of information I will gather with this from larger texts.

An example of the concept comparison in action. From the following tweet:

>>> randomFriend()
Picked mendicot: @offbeattravel .. FYI, my Twitter bot 
@vagabot found you by parsing (and attempting to answer) 
travel questions off the Twitter firehose ..

I received the following concepts:

5 concepts found.
[u'http://dbpedia.org/class/yago/Bot102311879',
u'http://dbpedia.org/class/yago/ChangeOfLocation107311115',
u'http://dbpedia.org/class/yago/FYI(TVSeries)',
u'http://dbpedia.org/class/yago/Locomotion100283127',
u'http://dbpedia.org/class/yago/Travel100295701']

The findCommonParent function takes two URIs and processes them, appending a new list with the superClasses of the initial URI. This way I can track all the ‘hops’ made by counting the list number. As soon as the function processed both URIs, it starts comparing the pathLists to determine the first common parent.

>>> findCommonParents(found[1],found[3])

[findParents]	http://dbpedia.org/class/yago/ChangeOfLocation107311115
[findParents]	Hop | Path:
[findParents]	0   | [u'http://dbpedia.org/class/yago/ChangeOfLocation107311115']
[findParents]	1   | [u'http://dbpedia.org/class/yago/Movement107309781']
[findParents]	2   | [u'http://dbpedia.org/class/yago/Happening107283608']
[findParents]	3   | [u'http://dbpedia.org/class/yago/Event100029378']
[findParents]	4   | [u'http://dbpedia.org/class/yago/PsychologicalFeature100023100']
[findParents]	5   | [u'http://dbpedia.org/class/yago/Abstraction100002137']
[findParents]	6   | [u'http://dbpedia.org/class/yago/Entity100001740']
[findCommonP]	1st URI processed

[findParents]	http://dbpedia.org/class/yago/Locomotion100283127
[findParents]	Hop | Path:
[findParents]	0   | [u'http://dbpedia.org/class/yago/Locomotion100283127']
[findParents]	1   | [u'http://dbpedia.org/class/yago/Motion100279835']
[findParents]	2   | [u'http://dbpedia.org/class/yago/Change100191142']
[findParents]	3   | [u'http://dbpedia.org/class/yago/Action100037396']
[findParents]	4   | [u'http://dbpedia.org/class/yago/Act100030358']
[findParents]	5   | [u'http://dbpedia.org/class/yago/Event100029378']
[findParents]	6   | [u'http://dbpedia.org/class/yago/PsychologicalFeature100023100']
[findParents]	7   | [u'http://dbpedia.org/class/yago/Abstraction100002137']
[findParents]	8   | [u'http://dbpedia.org/class/yago/Entity100001740']
[findCommonP]	2nd URI processed

[findCommonP]	CommonParent found!
[findCommonP]	Result1[3][0] [findCommonP]	matches with result2[5][0]
[findCommonP]	http://dbpedia.org/class/yago/Event100029378
[findCommonP]	http://dbpedia.org/class/yago/Event100029378

Here you can see the first common parentClass is ‘Event’: 3 hops away from ‘ChangeOfLocation’, and 5 hops away from ‘Locomotion’. If it finds multiple superClasses, it will process multiple URIs at the same time (in one list). Anyway, this is just the basic stuff. There’s plenty more on my to-do list…

While the major part of the functionality I’m building for @sem_web will be directly usable for my thesis project, I haven’t been sitting still with more directly thesis-related things either. I’ve set up a local RDF store (Sesame store) on my laptop with all the needed bio-ontologies. RDFLib’s in-memory stores were clearly not up for the large ontologies I had to load each time. This also means I have to better structure my queries, as all information is not available at any given time. I also – unfortunately – learned that one of my initial plans: finding the shortest path between two nodes in an RDF store to determine ‘proximity’, is actually quite a complicated task. Next I will focus more on improving the concept comparison, taking more properties into account than only rdfs:subClass, and I’ll also work on extracting keywords (which I haven’t, but should have arranged testing data for)… Till next time!

But mostly, the last weeks I’ve been learning SPARQL, improving my Python skills, and getting a better and more concrete idea of the possible approaches for my thesis project by working on sem_web.

[All thesis-related posts]

Embodied Vision Turtle

Project by Peter Curet & David Graus for the ‘Embodied Vision’ course by Joost Rekveld for the Media Technology MSc. Programme at Leiden University.

We compare the movement of the webcam input (adding up all movement towards the left and right, and up and down). This results in two numbers which represent the total amount of movement since the start.

The turtle graphic system draws on the basis of character-input:
– ‘w’ makes it move forward
– ‘a’ makes it turn left (but doesn’t draw anything)
– ‘d’ makes it turn right (same)
– ‘s’ changes the thickness of the line
– ‘c’ the color

The turtle receives a number of random strings from the genetic algorithm. It calculates the amount and direction of movement each string results in. Then it compares all these numbers to the numbers of the webcam movement. The more alike, the fitter we consider the string. We select the fittest string out of the number of strings it received, and make the turtle draw it. This string is the basis for the ‘next generation’ of strings. It is fed to the genetic algorithm which evolves this string into multiple other strings. The process repeats to infinity. Since the webcam input is dynamic and ever-changing, the fitness of the strings will not gradually rise, but it an ever-changing value.

CS Column 4: Numbers

Small scales, huge numbers

I’ve recently been reading a bit about nanotechnology, and I realized the contradiction that thinking of such insanely small scales brings. You’ll always end up dealing with huge numbers.

In order to try to grasp something as tiny as a nanometer, we try to convert it to the next best thing – the smallest tangible, imaginable distance, the smallest scale on your average rule: a millimeter. This conversion forces us to use hard to imagine scales: a billion nanometers are supposed to fit in between one of the ten tiny lines on your ruler which divide a centimeter. One billion, in such a tiny space? To me, it’s impossible to even imagine such a number. Let alone to mentally chop up this centimeter-space on a ruler in a billion bits. How do I know – other than ‘very tiny’ – how big a nanometer is?

One example I recently came across stuck with me. Supposedly, during the time it takes us to pronounce the word ‘nanometer’, our hair grows ten! This fact impressed me. But, this hair example is not a stranger when it comes to imagining small scales. There are a few often-used examples of making nanometers or other small scales imaginable, one of them being to use the width of a human hair to illustrate the nano-scale.

But how helpful is that? According to one source, the average width of a human hair can vary from around 17 to 181 µm (that’s micrometer: a millionth of a meter. Huge compared to a nanometer). That means a human hair can vary from 17.000 to 181.000 nanometers. Let’s take a look back at our hair growth example. While your hair grows ten nanometers in one direction (in the time it takes you to pronounce the word ‘nanometer’), the other direction can be up to 181.000 nanometers long. That puts this impressive fact into perspective.

In the end, a nanometer is an abstract unit of measure we cannot use it in everyday life. And why would we? We can’t use it for ‘real-life’ measurement. We either use it when downscaling from bigger scales, and consequently end up with huge numbers. Or we use it when we deal with the totally abstract world of molecules and atoms, and then we end up in the even harder to imagine abstract world. Any attempt to make the scale tangible deals with intangible smallness. We’re always stuck with the contradiction of using huge numbers to imagine tiny scales.

CS Column 3: Uncertainty

The case of my disappearing socks

I keep losing stuff. Even though I live on a surface of seven square meters I manage to misplace and lose all kinds of stuff. More than once pairs of my socks get separated, resulting in me having to wear two different socks. This leaves me wondering: did I lose these socks, or do they magically disappear by themselves? More often than not, the latter seems more likely to me.

Like a religious man clinging on to old stories to explain the inexplicable, I arm myself with science. “It’s not my fault” I tell my girlfriend, “it’s because my socks are wavy.” “… it has to do with quantum mechanics!” I bluff. This intimidating set of scientific principles can be my best friend when I’m blamed for losing stuff.

It works from inside the socks. Let’s take a closer look at my socks. Zoom in all the way, until the separate fibers that make up the sock’s fabric are exposed. Now keep on zooming, until eventually the structure of these fibers will show itself in the molecular scale. Keep on zooming still until you reach the atom-level, previously thought to be the smallest elements in our universe. Now we’re close: keep on zooming, until finally these elements break down into their subatomic parts – electrons and atomic nuclei, made up out of protons and neutrons. This is where the magic happens. This is what makes my sock disappear.

The problem lies in the behavior of the tiny particles that make up the atoms. Take electrons for example: we imagine electrons as tiny balls that fly never-ending circles around the atomic nuclei. But they’re not. Electrons are not simply miniscule balls flying around, they don’t behave like particles in a fixed trajectory. At least, sometimes they do. But at other times, they behave like a wave.

Now this wavy behavior is interesting: since a wave is never on one location at any given time, but rather on multiple locations ‘spread out through space’, it is impossible to know or measure the exact position of an electron at a specific moment in time. This means an electron has a multitude of possible locations at any moment.

So if the things in atoms behave like wavy things – wavy things with multiple possible positions, of which we can’t pinpoint the exact one – doesn’t that mean this also goes for the atoms they constitute, and for the molecules the atoms add up to, and consequently for the fibers of the fabric that make the sock? Wouldn’t it mean that if all atoms ‘wave’ their way to some other place, my sock would ride along in this atomic wave, and change its position?

So the key question is: are my socks really wavy!? Unfortunately, the answer is no. It’s not as simple as I’d like it to be: upscaling the weirdness of the microscopic world to the real world just doesn’t work. The reason a subatomic particle can show wavy behavior is not because of its scale, but because of its isolation. A single, isolated particle behaves like it does because it is isolated. Only if a subatomic particle is completely isolated, it behaves like a weird wavy thing. More surprisingly, this also implies that even to this day, science has failed to demystify the underlying mechanism of my disappearing socks. I can still bluff my way through, though. Quantum mechanics are to blame!

Read my 2nd column for the Cool Science class:
» Mobb Deep’s Vision on Evolution Theory

Read my 1st column for the Cool Science class:
» Emerging Chaos – The Rules of Vietnamese Traffic