Even though I am still in the process of arranging all the paperwork, I am stoked to be spending my summer in Redmond, Washington at Microsoft Research! To be precise, I will join the CLUES group (Context, Learning and User Experience for Search), where I’ll be mentored by Paul Bennett, Ryen White, and Eric Horvitz.
BètaBreak is a monthly panel discussion in the University of Amsterdam’s Science Park main hall, where guests speak about current topics in science. I am invited to join a panel discussion titled “Behind the Algorithm” (subtitled: “What the Internet hides from you”) on algorithms, personalization, filtering and the filter bubble, together with Joost Schellevis and Manon Oostveen. Wednesday! See the video + flyer below.
Together with 904Labs‘ Wouter Weerkamp and Manos Tsagkias I’ve submitted a project proposal for funding by the “Stimuleringsfonds voor de Journalistiek” for innovating the press. The idea in a nutshell is automated knowledge base construction for news (paper) archives.
For more information see the abstract below (in Dutch), our website:Â nieuwsinzicht.nu and our submitted proposal at Persinnovatie.nl!
NieuwsInzicht is een automatisch samengestelde, gestructureerde kennisbank waarin onderwerpen uit het nieuws centraal staan.
Nieuws gaat inherent over personen, plaatsen, organisaties, of producten. NieuwsInzicht is een online kennisbank waarin deze onderwerpen uit regionaal en landelijk nieuws centraal staan. In tegenstelling tot Wikipedia vullen niet gebruikers, maar algoritmen deze online kennisbank.
Wanneer politici in opspraak is het zaak voor journalisten om in nieuws- en krantenarchieven te graven naar achtergrondinformatie: wat is er bekend over deze personen? Waar hebben ze gewerkt? Met wie hebben ze gewerkt? Nu zijn journalisten daarvoor nog aangewezen op het handmatig doorzoeken van archieven als LexisNexis, Google News, of zelfgeselecteerde bronnen.
NieuwsInzicht scrapet content van regionale en landelijke kranten en nieuwssites, en identificeert met behulp van automatische tekstanalyse personen, plaatsen, producten en organisaties die genoemd worden. NieuwsInzicht organiseert deze onderwerpen in individuele pagina’s, met links naar de bronnen waar ze genoemd worden, en analyses van de verzamelde content. Zo biedt NieuwsInzicht in een oogopslag een overzicht van welke onderwerpen in de media aan bod zijn gekomen, wat er gepubliceerd is, uit welke bronnen, wanneer en hoe verschillende onderwerpen met elkaar samenhangen.
So Dr. Maria-Hendrike Peetz successfully defended her PhD thesis today! (Congrats, Dr. Peetz!). The cover is made by me (click for PDF), and we added fancy UV-spot glossy ripples, which are awesome. On top of that, during the defense, Max Welling from Hendrike’s committee asked a (“warmup”-) question about the cover :-). Such joy.
As I’ve recently had the joy of struggling with using PyLucene (after many years), I re-entered the void of documentation straight after actually managing to compile and install the thing. I ended up at the five year old blog post “PyLucene 3.0 in 60 seconds â tutorial sample code for the 3.0 API” by Joseph Turian (that conveniently lets one infer syntax and functionalities of PyLucene) many times while googling. However, the example code no longer works, as in pylucene 4.0 some things changed, in particular;
Starting with version 4.0, pylucene changed from a flat to nested namespace, mirroring the java hierarchy. ~ source
I am running PyLucene 4.10.1, so I find whatever I need in the 4.10.1 Javadocs. Below is the PyLucene 3.0 in 60 seconds blogpost example updated for PyLucene 4.0 (and beyond…?), which I figured may be of use to those that start to dabble in PyLucene. Many thanks to Joseph for the original post!
Indexer
import sys
import lucene
from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field
from org.apache.lucene.index import IndexWriter, IndexWriterConfig
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.util import Version
if __name__ == "__main__":
lucene.initVM()
indexDir = SimpleFSDirectory(File("index/"))
writerConfig = IndexWriterConfig(Version.LUCENE_4_10_1, StandardAnalyzer())
writer = IndexWriter(indexDir, writerConfig)
print "%d docs in index" % writer.numDocs()
print "Reading lines from sys.stdin..."
for n, l in enumerate(sys.stdin):
doc = Document()
doc.add(Field("text", l, Field.Store.YES, Field.Index.ANALYZED))
writer.addDocument(doc)
print "Indexed %d lines from stdin (%d docs in index)" % (n, writer.numDocs())
print "Closing index of %d docs..." % writer.numDocs()
writer.close()
Retriever
import sys
import lucene
from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.index import IndexReader
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.util import Version
if __name__ == "__main__":
lucene.initVM()
analyzer = StandardAnalyzer(Version.LUCENE_4_10_1)
reader = IndexReader.open(SimpleFSDirectory(File("index/")))
searcher = IndexSearcher(reader)
query = QueryParser(Version.LUCENE_4_10_1, "text", analyzer).parse("Find this sentence please")
MAX = 1000
hits = searcher.search(query, MAX)
print "Found %d document(s) that matched query '%s':" % (hits.totalHits, query)
for hit in hits.scoreDocs:
print hit.score, hit.doc, hit.toString()
doc = searcher.doc(hit.doc)
print doc.get("text").encode("utf-8")
I created a logo for SEA: Search Engines Amsterdam, a monthly meetup where industry & academia talks about search engines and information retrieval. It looks like this:
“The award is given to an ILPS member that has made significant contributions to the group as a whole. The winner is selected based on votes by group members.” [source]