As I’ve recently had the joy of struggling with using PyLucene (after many years), I re-entered the void of documentation straight after actually managing to compile and install the thing. I ended up at the five year old blog post “PyLucene 3.0 in 60 seconds — tutorial sample code for the 3.0 API” by Joseph Turian (that conveniently lets one infer syntax and functionalities of PyLucene) many times while googling. However, the example code no longer works, as in pylucene 4.0 some things changed, in particular;
Starting with version 4.0, pylucene changed from a flat to nested namespace, mirroring the java hierarchy. ~ source
I am running PyLucene 4.10.1, so I find whatever I need in the 4.10.1 Javadocs. Below is the PyLucene 3.0 in 60 seconds blogpost example updated for PyLucene 4.0 (and beyond…?), which I figured may be of use to those that start to dabble in PyLucene. Many thanks to Joseph for the original post!
Indexer
import sys import lucene from java.io import File from org.apache.lucene.analysis.standard import StandardAnalyzer from org.apache.lucene.document import Document, Field from org.apache.lucene.index import IndexWriter, IndexWriterConfig from org.apache.lucene.store import SimpleFSDirectory from org.apache.lucene.util import Version if __name__ == "__main__": lucene.initVM() indexDir = SimpleFSDirectory(File("index/")) writerConfig = IndexWriterConfig(Version.LUCENE_4_10_1, StandardAnalyzer()) writer = IndexWriter(indexDir, writerConfig) print "%d docs in index" % writer.numDocs() print "Reading lines from sys.stdin..." for n, l in enumerate(sys.stdin): doc = Document() doc.add(Field("text", l, Field.Store.YES, Field.Index.ANALYZED)) writer.addDocument(doc) print "Indexed %d lines from stdin (%d docs in index)" % (n, writer.numDocs()) print "Closing index of %d docs..." % writer.numDocs() writer.close()
Retriever
import sys import lucene from java.io import File from org.apache.lucene.analysis.standard import StandardAnalyzer from org.apache.lucene.document import Document, Field from org.apache.lucene.search import IndexSearcher from org.apache.lucene.index import IndexReader from org.apache.lucene.queryparser.classic import QueryParser from org.apache.lucene.store import SimpleFSDirectory from org.apache.lucene.util import Version if __name__ == "__main__": lucene.initVM() analyzer = StandardAnalyzer(Version.LUCENE_4_10_1) reader = IndexReader.open(SimpleFSDirectory(File("index/"))) searcher = IndexSearcher(reader) query = QueryParser(Version.LUCENE_4_10_1, "text", analyzer).parse("Find this sentence please") MAX = 1000 hits = searcher.search(query, MAX) print "Found %d document(s) that matched query '%s':" % (hits.totalHits, query) for hit in hits.scoreDocs: print hit.score, hit.doc, hit.toString() doc = searcher.doc(hit.doc) print doc.get("text").encode("utf-8")
Leave a Reply