Solr and Lucene

Started Learning Solr recently. Here are some of the key points(from here):

  • Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP APIs, caching, replication, and a web administration interface.
  • Its a Information Retrieval Application.
  • Index/Query Via HTTP
  • Scalability – Efficient Replication To Other Solr Search Servers
  • Highly Configurable And User Extensible Caching

Solr Configuration:

  • schema.xml: Where the data is described
  • solrconfig.xml: Where it is decribed how people can interact with the data

Loading Data:

  • Documents can be added, deleted or replaced.
  • Message Transport: HTTP POST
  • Message Format: XML
  • Example:

<add><doc>
<field name=”id”>SOLR</field>
<field name=”name”>Apache Solr</field></doc></add>

Querying Data:

  • Transport Protocol: HTTP GET
  • Example:

http://solr/select?q=electronics

schema.xml:

Decides option for various fields.

  • Is it a number? A string? A date?
●Is there a default value for documents that don’t have one?
  • Is it created by combining the values of other fields?
  • Is it stored for retrieval?
  • Is it indexed? If so is it parsed? If so how?
  • Is it a unique identifier?

Fields:

  • <field>Describes How You Deal With Specific Named Fields
  • Example:

<field name=”title” type=”text” stored=”false” />

Field Type:

  • The Underlying Storage Class (FieldType)
  • The Analyzer To Use Or Parsing If It Is A Text Field
  • Example:

<fieldType name=”sfloat” sortMissingLast=”true” omitNorms=”true” />

Analyzer:

  • ‘Analyzer’ Is A Core Lucene Class For Parsing Text
  • Example:

<fieldType name=”text_greek” class=”solr.TextField>
<analyzer class=”org.apache.lucene.analysis.el.GreekAnalyzer”/>
</fieldType>

Tokenizers And TokenFilters:

  • Analyzers Are Typical Comprised Of Tokenizers And TokenFilters
  • Tokenizer: Controls How Your Text Is Tokenized
  • TokenFilter: Mutates And Manipulates The Stream Of Tokens
  • Solr Lets You Mix And Match Tokenizers and TokenFilters In Your schema.xml To Define Analyzers On The Fly
  • Example:

<fieldType name=”text” class=”solr.TextField”> <analyzer type=”index”>
<tokenizer class=”solr.WhitespaceTokenizerFactory”/>
</analyzer>
<analyzer type=”query”>
<tokenizer class=”solr.WhitespaceTokenizerFactory”/>
<filter class=”solr.SynonymFilterFactory” synonyms=”synonyms.txt” expand=”true”/>
</analyzer>

solrconfig.xml:

This is where you configure options for how this Solr instance should behave.Low-Level Index Settings

  • Performance Settings (Cache Sizes, etc…)
  • Types of Updates Allowed
  • Types of Queries Allowed

Note:    
● solrconfig.xml depends on schema.xml.
   ● schema.xml does not depend on solrconfig.xml.