Monday, May 14, 2012

Lucene.NET

overview of Lucene.NET on Twitpic

I saw Chander Dhall of Ria Consulting speak at the Austin.NET User's Group tonight. He spoke on Lucene.NET (which is a port of Lucene for Java). I learned:

  1. SOLR and something called Elastic Search (which scales better) both run on top of Lucene which itself runs in Apache. Lucene.NET offers a .dll. Perhaps the .dll lets you interface with the Apache-running stuff from the C# side??? I don't know.
  2. Lucene may store stuff in raw RAM in lieu of merely databases or flat files making it really fast.
  3. The configuration for searches is very capable. Phrase Queries allow one to match terms that are "close" to a term suggested in a search. Fuzzy Queries match phonetically!
  4. .cfx extension files are index files, made from crawling content.
  5. The auto suggest feature at a search field could be driven from crawling existing .cfx files and creating auto suggest-specific indexes. Some of Chander's code here offers an IndexWriter.
    some of Chander's code... he is making an index of a par... on Twitpic
    Chander suggested that one may game auto suggest to make the product you really want to sell appear first in a search of products. There is a door open for this sort of manipulation with Lucene.
  6. Searches crawl indexes and thus it is not wise to search against common words like "and" or "the."
    "the" and "and" are killers!!! on Twitpic
    Luckily, Lucene will compensate for this problem for you if you'll let it.
  7. Don't put sensitive data like the cost of a product to your company in Lucene. Anything that needs to be hidden by security is something that Lucene shouldn't be concerned with to begin with. You can secure this sort of stuff in Lucene, but it will just make searches slower.
  8. Tika is for rich text extraction.
  9. Eric Hatcher is someone to find on YouTube for how-to stuff on Lucene.

No comments:

Post a Comment