Friday, 14 October 2016

Lucene : A search Library


Lucene is a library built in Java used for searching of text . It allows to perform queries on text returning the result relevant to search keywords. The lucene can search from various sources like  SQL/NoSQL database, a filesystem, or even from websites.

Search And Index Functionality:
The working principle of Lucene is based on Index not on Text. So the Lucene Library is very efficient and fast compared to other libraries. The working principle can be related to an example of book. Assume that you want to search a keyword from a book. Instead of searching by textual context you can search it according to index page. That’s why your search will become faster saving lot of time. The searching method is called as inverted index.
Lucene works on Documents as a unit of search and index.
Index contains one or more Documents. Each entry of table in lucene is considered as a Lucene Document.

Document contains one or more fields.a Field commonly found in applications is title. In the case of a titleField, the field name is title and the value is the title of that content item.

Searching requires an index to have already been built. It involves creating a Query(usually via a QueryParser) and handing this Query to an IndexSearcher, which returns a list of Hits.
Lucene has its own mini-language for performing searches. Read more about theLucene Query Syntax
The Lucene query language allows the user to specify which field(s) to search on, which fields to give more weight to (boosting), the ability to perform boolean queries (AND, OR, NOT) and other functionality.

How It works:
When a document is loaded to its index directory from a Java Input Stream , the text is captured from files, Databases or even websites also. After this the index page is created.  We can see that each page is reffered to as Lucene Document.
The lucene standard tokenizer removes punctuations, and actual work start here. Many classes from lucene accepts single words.

1.High performance indexing
2. Less ram required
3. Total indexed size is 20-30% less