Lucene
1

Getting started with Lucene: Creating an Index -

This is a based on an article we originally created at at the Architexa Blog.

One of the first steps in using Lucene is creating an index. I wanted to write down some of the main parts of this that I found interesting while learning Lucene. Hopefully this will help others that are trying to find detailed information on one of the most popular search infrastructures available.

The Index is the heart of any component that utilizes Lucene. Much like the index of a book, it organizes your data so that it is quickly accessible. An index consists of Documents containing one or more Fields. Documents and Fields can represent whatever you choose but a common metaphor is that Documents represent entries in a database table and Fields are akin to the fields in the table.

Creating the IndexWriter

Lucene Indexs must be created in a specific Directory and with a specific Analyzer. The Directory can be in memory but is typically a FSDirectory since most indexes are too large to store in memory. The Analyzer parses data that is added to your index into tokens so that it will be searchable. More details can be found here.

IndexWriter writer = new IndexWriter(FSDirectory.open("/index"), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);

As seen above the three decisions you must make are: what directory should your index be located in, what type of Analyzer works best for your data, and the third parameter which is whether or not you want to create a new Index. NOTE: If the third parameter is true you will overwrite any index currently residing in the chosen directory.

Add a Document

The next step is to add each Document to your index. To do this call addDocument() as shown below.

Document document = new Document();
document.add(new Field("name", "Page 1", Field.Store.YES, Field.Index.NO));
document.add(new Field("text", "Lorum Ipsum...", Field.Store.YES, Field.Index.TOKENIZED)); 
writer.addDocument(doc);

Each document can contain as many fields as necessary and documents within the same index are not required to contain equivalent fields. For instance; an index containing webpages with fields of title, keywords, and body may also contain employees with fields of name, phone number, birthday, and address. Fields can be tagged with flags to let Lucene know whether they should be indexed, stored, or tokenized. There is a good resource for understanding this here

Deleting a Document

To remove a document from an index you must first find the document by querying it. Queries will be covered more in subsequent posts.

writer.deleteDocuments(query); 

Lucene allows duplicate documents to be added to an index. This can make updating documents more complex than simply overwriting them. In previous versions of Lucene you had to manually remove the document and then add the updated one back. Since Lucene version 2.2, you can call updateDocument(term,document) where 'term' is the document(s) to be deleted gathered from a query and 'document' is the updated document to be added back in.

A problem that is often encountered is a document not being deleted because its id field is tokenized. deleteDocument and updateDocument only work to remove docs containing the exact Term they've been given, which a doc with a tokenized id won't contain. The solution is to make sure your doc has an id that's untokenized. See the above section on fields for more information.

Closing the Writer

Finally once the changes have been submitted to the IndexWritter call the following methods: writer.optimize(); writer.close(); This will optimize the index for searching and make sure all the changes are recorded properly.

Tips and Tricks

  • When adding documents leave your writer open and only close the writer / optimize when absolutely necessary. These two methods result in a lot of overhead so be done sparingly and only when instantaneousness updating is necessary.
  • When first creating an index make sure that you pass 'true' as a parameter only if you wan to create a brand new index
  • Adding Documents correctly can be tricky in some cases. Make sure to read up on the different Analyzers and Field types

Let me know if you have any questions!




Add Comment   [Sign In to Comment]

About This Diagram
Created By: Vineet Sinha
Time Added: 3 years ago
Share This Diagram
Embed Link:
Tags
See Other Diagrams
  • BooleanQuery Types with Lucene 3.0 - (4 years ago)
  • How Data is Stored: Lucene 3.3 - (2 years ago)
  • Overview of Lucene Core - (2 years ago)
  • Extending the Similarity Class - (2 years ago)
  • How Data is Stored: Lucene 3.0 - (2 years ago)
  • Analysis Package Overview - (3 years ago)
  • Searching an Index (Draft) - (2 years ago)
  • Details of the Core with Context: 3.3 - (3 years ago)
  • Analyzers and Tokenizers - (3 years ago)
  • Creating an IndexSearcher and QueryParser to search an Index - (3 years ago)
  • Queries and Hits: Lucene 1.9.1 - (3 years ago)
  • How Data is Stored: Lucene 1.9.1 - (3 years ago)
  • Memory and the merge factor: Lucene 3.3 - (3 years ago)
  • Adding to the Index with Lucene 3.3 - (3 years ago)
  • Searching with Lucene 3.0 - (4 years ago)
  • Searching an Index with Lucene 1.9.1 - (4 years ago)
  • Using Fields and TopDocs when searching with Lucene 3.0 - (4 years ago)
  • Releasing Locks: Lucene 1.9.1 - (4 years ago)
  • Merging: Lucene 1.9.1 - (4 years ago)
  • Analysis: Lucene 3.3 - (3 years ago)
  • Deleting Documents with Lucene 1.9.1 - (4 years ago)
  • Analysis: Lucene 1.9.1 - (4 years ago)
  • Creating an Index with Lucene 3.3 - (3 years ago)