Summer internship at Microsoft Research Redmond

Even though I am still in the process of arranging all the paperwork, I am stoked to be spending my summer in Redmond, Washington at Microsoft Research! To be precise, I will join the CLUES group (Context, Learning and User Experience for Search), where I’ll be mentored by Paul Bennett, Ryen White, and Eric Horvitz.

Seattle here I come 😄!

Microsoft Campus @ Redmond

Talk at BètaBreak — “Behind the Algorithm”

BètaBreak is a monthly panel discussion in the University of Amsterdam’s Science Park main hall, where guests speak about current topics in science. I am invited to join a panel discussion titled “Behind the Algorithm” (subtitled: “What the Internet hides from you”) on algorithms, personalization, filtering and the filter bubble, together with Joost Schellevis and Manon Oostveen. Wednesday! See the video + flyer below.

Video

Flyer

NieuwsInzicht proposal for innovating the press

Together with 904LabsWouter Weerkamp and Manos Tsagkias I’ve submitted a project proposal for funding by the “Stimuleringsfonds voor de Journalistiek” for innovating the press. The idea in a nutshell is automated knowledge base construction for news (paper) archives.

For more information see the abstract below (in Dutch), our website: nieuwsinzicht.nu and our submitted proposal at Persinnovatie.nl!

nieuwsinzicht

NieuwsInzicht is een automatisch samengestelde, gestructureerde kennisbank waarin onderwerpen uit het nieuws centraal staan.

Nieuws gaat inherent over personen, plaatsen, organisaties, of producten. NieuwsInzicht is een online kennisbank waarin deze onderwerpen uit regionaal en landelijk nieuws centraal staan. In tegenstelling tot Wikipedia vullen niet gebruikers, maar algoritmen deze online kennisbank.

Wanneer politici in opspraak is het zaak voor journalisten om in nieuws- en krantenarchieven te graven naar achtergrondinformatie: wat is er bekend over deze personen? Waar hebben ze gewerkt? Met wie hebben ze gewerkt? Nu zijn journalisten daarvoor nog aangewezen op het handmatig doorzoeken van archieven als LexisNexis, Google News, of zelfgeselecteerde bronnen.

NieuwsInzicht scrapet content van regionale en landelijke kranten en nieuwssites, en identificeert met behulp van automatische tekstanalyse personen, plaatsen, producten en organisaties die genoemd worden. NieuwsInzicht organiseert deze onderwerpen in individuele pagina’s, met links naar de bronnen waar ze genoemd worden, en analyses van de verzamelde content. Zo biedt NieuwsInzicht in een oogopslag een overzicht van welke onderwerpen in de media aan bod zijn gekomen, wat er gepubliceerd is, uit welke bronnen, wanneer en hoe verschillende onderwerpen met elkaar samenhangen.

 

Thesis cover “Time-Aware Online Reputation Analysis”

So Dr. Maria-Hendrike Peetz successfully defended her PhD thesis today! (Congrats, Dr. Peetz!). The cover is made by me (click for PDF), and we added fancy UV-spot glossy ripples, which are awesome. On top of that, during the defense, Max Welling from Hendrike’s committee asked a (“warmup”-) question about the cover :-). Such joy.

Time-Aware Online Reputation Analysis

PyLucene 4.0 (in 60 seconds) tutorial

pylucene's extensive documentation
pylucene’s extensive documentation

As I’ve recently had the joy of struggling with using PyLucene (after many years), I re-entered the void of documentation straight after actually managing to compile and install the thing. I ended up at the five year old blog post “PyLucene 3.0 in 60 seconds — tutorial sample code for the 3.0 API” by Joseph Turian (that conveniently lets one infer syntax and functionalities of PyLucene) many times while googling. However, the example code no longer works, as in pylucene 4.0 some things changed, in particular;

Starting with version 4.0, pylucene changed from a flat to nested namespace, mirroring the java hierarchy. ~ source

I am running PyLucene 4.10.1, so I find whatever I need in the 4.10.1 Javadocs. Below is the PyLucene 3.0 in 60 seconds blogpost example updated for PyLucene 4.0 (and beyond…?), which I figured may be of use to those that start to dabble in PyLucene. Many thanks to Joseph for the original post!

Indexer

import sys
import lucene

from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field
from org.apache.lucene.index import IndexWriter, IndexWriterConfig
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.util import Version

if __name__ == "__main__":
  lucene.initVM()
  indexDir = SimpleFSDirectory(File("index/"))
  writerConfig = IndexWriterConfig(Version.LUCENE_4_10_1, StandardAnalyzer())
  writer = IndexWriter(indexDir, writerConfig)

  print "%d docs in index" % writer.numDocs()
  print "Reading lines from sys.stdin..."
  for n, l in enumerate(sys.stdin):
    doc = Document()
    doc.add(Field("text", l, Field.Store.YES, Field.Index.ANALYZED))
    writer.addDocument(doc)
  print "Indexed %d lines from stdin (%d docs in index)" % (n, writer.numDocs())
  print "Closing index of %d docs..." % writer.numDocs()
  writer.close()

Retriever

import sys
import lucene

from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.index import IndexReader
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.util import Version

if __name__ == "__main__":
    lucene.initVM()
    analyzer = StandardAnalyzer(Version.LUCENE_4_10_1)
    reader = IndexReader.open(SimpleFSDirectory(File("index/")))
    searcher = IndexSearcher(reader)

    query = QueryParser(Version.LUCENE_4_10_1, "text", analyzer).parse("Find this sentence please")
    MAX = 1000
    hits = searcher.search(query, MAX)

    print "Found %d document(s) that matched query '%s':" % (hits.totalHits, query)
    for hit in hits.scoreDocs:
        print hit.score, hit.doc, hit.toString()
        doc = searcher.doc(hit.doc)
        print doc.get("text").encode("utf-8")