PyLucene 4.0 (in 60 seconds) tutorial
March 19, 2015
As I’ve recently had the joy of struggling with using PyLucene (after many years), I re-entered the void of documentation straight after actually managing to compile and install the thing. I ended up at the five year old blog post “PyLucene 3.0 in 60 seconds — tutorial sample code for the 3.0 API” by Joseph Turian (that conveniently lets one infer syntax and functionalities of PyLucene) many times while googling. However, […]
Algorithm performance measurement: Confusion Matrix
March 3, 2012
As I am starting to gather testing data, I figured it’d be a good time to determine how to measure the performance of the different results of my 24 text representation algorithms. What I want to measure per algorithm: how many keywords are predicted the same as the experts, and how many aren’t. After some research […]
Pathfinder finds paths!
November 20, 2011
First results are in! My computer spent about 20 hours to retrieve and store neighboring concepts of over 10.000 concepts which my Breadth-First-Search algorithm passed through to find the shortest paths between 6 nodes. But here is the result, first the ‘before’ graph, which I showed earlier: all retrieved concepts with their parent relations. Below that is […]
Ontology-based Text Visualization: Towards a Visual Language
November 12, 2011
The first steps towards creating graph-based text visualizations is thinking about how I would like to visually represent the structure of the information I extract from texts. I have included a preliminary example of the output at the bottom of this page. I use three different methods of extracting concepts from a piece of text: […]
Add some #dataviz in the mix
October 25, 2011
All this time I have been working on my ontology-powered topic identification system: combining semantic web technologies with natural language processing and text-mining techniques to extract (a) topic(s) from a text. But I never really decided what my program should output: a topic? A list of potential topics with their ‘likeliness’? That would mean the […]
#OccupyAmsterdam wordle
October 16, 2011
Wordle van de 200 meest voorkomende woorden in tweets met hashtag #OccupyAmsterdam. Gemaakt van 5.239 tweets van tussen zaterdag 8 oktober 09:55 uur en 16 oktober 15:50 uur. Handmatig gefilterd op nicknames en nietszeggende woorden. Hier is de lijst van de 1000 meest voorkomende woorden: OccupyAmsterdam-woorden.
Computing string similarity with TF-IDF and Python
October 3, 2011
“The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.”[wikipedia] It is also the weight I use to measure similarity between texts, for these two […]
Python graphs and visualizations
September 18, 2011
To my right is a visualization of the output of my SPARQL-powered shortest path algorithm, finding a link between ‘intracellular and extracellular accumulation’ & ‘developmental and adult structural defect’, 2 concepts in the Mouse Pathology ontology. Click it! It shows the two ‘source’ concepts in white, and the shortest path (of 3 nodes: 4 hops) in […]
Simple keyword extraction in Python: choices, choices.
September 12, 2011
As explained in an earlier post, I am working on a simple method of extracting ‘important words’ from a text-entry. The methods I am using at the moment are frequency distributions and word collocations. I’ve bumped into some issues regarding finetuning my methods. Read on for a short explanation of my approaches, and some issues […]
Ontology-based semantic similarity measurements: an overview
September 9, 2011
My thesis is about keyword extraction of biological notes, using semantic ‘dictionaries’ called ontologies. These ontologies are large networks, where each node stands for a concept, and each connection between nodes for relations. See the picture on the right for a visual representation of an ontology. To identify the subject of a text, I need […]
Results? Thesis #5
September 5, 2011
As promised, I have spent the last two weeks generating a lot (but not quite 120) results. So let’s take a quick look at what I’ve done and found. First of all, the Cyttron DB. Here I show 4 different methods of representing the Cyttron database, the 1st is as-is (literal), the 2nd by keyword […]
Graduation pt. 4: What’s next
August 19, 2011
Just a quick update to let myself know what’s going to happen next: It’s time to produce some results! While I was getting quite stuck in figuring out the best – or rather, most practical – way to extract keywords from a text (and not just any text, mind you, but notes of biologists), my supervisor […]
DBPedia Twitterbot: Introducing @grausPi!
August 16, 2011
12/12/12 update: since @sem_web moved to live in my Raspberry Pi, I’ve renamed him @grausPi The last couple of days I’ve spent working on my graduation project by working on a side-project: @sem_web; a Twitter-bot who queries DBPedia [wikipedia’s ‘linked data’ equivalent] for knowledge. @sem_web is able to recognize 249 concepts, defined by the DBPedia ontology, and […]
Graduation Project pt. 2
July 29, 2011
So, I am well underway finalizing the first part of my graduation project, the information extraction part. To re-iterate, I am currently working on matching textual content of a database to that of several ontology-files (big dictionaries containing loads of ‘things’ with relations defined). This is a flow-chart of the system I’m planning to build:
Graduation project
July 8, 2011
Currently I am working on my final project of the Media Technology MSc. Programme of Leiden University. With the goal of structuring my thoughts and process so far, and because I’ve promised on Twitter, I decided to write a small and simple summary of what my project is about, how I got here and what I’m expecting to do in the next 2-3months.