More on Solr!

Some Note From the Book “Solr 1.4 Enterprise Search Server”:

Solr Index:
An index is basically like a single-table database schema. Imagine a massive spreadsheet, if you will. Inspite of this limitation, there is nothing to stop you from putting different types of data (say, artists and tracks from MusicBrainz) into a single index, thereby, in effect mitigating this limitation. All you have to do is use different fields for the different document types, and use a field to discriminate between the types. An identifier field would need to be unique across all documents in this index, no matter the type, so you could easily do this by concatenating the field type and the entity’s identifier. This may appear really ugly from a relational database design standpoint, but this isn’t a database.

Single Combined Index:
<field name=”id” … /> <!– example: “artist:534445″ –>
<field name=”type” … … –> <field name=”name” … <!– track fields: –> <field name=”PUID” … /> <field name=”num” … /> <!– i.e. the track # on the release –>

Problems with that?

  • There may be namespace collision problems unless you prefix the field names by type such as: artist_startDate and track_PUID.
  • If you share the same field for different things (like the name field in the example that we have just seen), then there are some problems that can occur when using that field in a query and while filtering documents by document type.
  • Prefix, wildcard, and fuzzy queries will take longer and will be more likely to reach internal scalability thresholds.
  • Committing changes to a Solr index invalidates the caches used to speed up querying. If this happens often, and the changes are usually to one type of entity in the index, then you will get better query performance by using separate indices.

Schema Design:
While doing schema design, a key thing to come to grips with is that a Solr schema strategy is driven by how it is queried and not by a standard third normal form decomposition of the data.

  • First determine which searches are going to be powered by Solr.
  • Second determine the entities returned for each search.
  • For each entity type, find all of the data in the schema that will be needed across all searches of it. By “all searches of it,” I mean that there might actually be multiple search forms, as identified in Step 1. Such data includes any data queried for (that is, criteria to determine whether a document matches or not) and any data that is displayed in the search results.For each entity type, find all of the data in the schema that will be needed across all searches of it. By “all searches of it,” I mean that there might actually be multiple search forms, as identified in Step 1. Such data includes any data queried for (that is, criteria to determine whether a document matches or not) and any data that is displayed in the search results.
  • If there is any data shown on the search results that is not queryable, not sorted upon, not faceted on, nor are you using the highlighter feature for, and for that matter are not using any Solr feature that uses the field except to simply return it in search results, then it is not necessary to include it in the schema for this entity. Let’s say, for the sake of the argument, that the only information queryable, sortable, and so on is a track’s name, when doing a query for tracks. You can opt not to inline the artist name, for example, into the track entity. When your application queries Solr for tracks and needs to render search results with the artist’s name, the onus would be on your application to get this data from somewhere—it won’t be in the search results from Solr. The application might look these up in a database or perhaps even query Solr in its own artist entity if it’s there or somewhere else.

Field Types:

The first section of the schema is the definition of the field types. In other words, these are the data types. This section is enclosed in the <types/> tag and will consume lots of the file’s content. The field types declare the types of fields, such as booleans, numbers, dates, and various text flavors.

Using copyField:

Closely related to the field definitions are copyField directives, which are specified at some point after the fields element, not within it. A copyField directive looks like this:
<copyField source=”r_name” dest=”r_name_sort” />
These are really quite simple. At index-time, each copyField is evaluated for each input document.

Solr and Lucene

Started Learning Solr recently. Here are some of the key points(from here):

  • Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP APIs, caching, replication, and a web administration interface.
  • Its a Information Retrieval Application.
  • Index/Query Via HTTP
  • Scalability – Efficient Replication To Other Solr Search Servers
  • Highly Configurable And User Extensible Caching

Solr Configuration:

  • schema.xml: Where the data is described
  • solrconfig.xml: Where it is decribed how people can interact with the data

Loading Data:

  • Documents can be added, deleted or replaced.
  • Message Transport: HTTP POST
  • Message Format: XML
  • Example:

<add><doc>
<field name=”id”>SOLR</field>
<field name=”name”>Apache Solr</field></doc></add>

Querying Data:

  • Transport Protocol: HTTP GET
  • Example:

http://solr/select?q=electronics

schema.xml:

Decides option for various fields.

  • Is it a number? A string? A date?
●Is there a default value for documents that don’t have one?
  • Is it created by combining the values of other fields?
  • Is it stored for retrieval?
  • Is it indexed? If so is it parsed? If so how?
  • Is it a unique identifier?

Fields:

  • <field>Describes How You Deal With Specific Named Fields
  • Example:

<field name=”title” type=”text” stored=”false” />

Field Type:

  • The Underlying Storage Class (FieldType)
  • The Analyzer To Use Or Parsing If It Is A Text Field
  • Example:

<fieldType name=”sfloat” sortMissingLast=”true” omitNorms=”true” />

Analyzer:

  • ‘Analyzer’ Is A Core Lucene Class For Parsing Text
  • Example:

<fieldType name=”text_greek” class=”solr.TextField>
<analyzer class=”org.apache.lucene.analysis.el.GreekAnalyzer”/>
</fieldType>

Tokenizers And TokenFilters:

  • Analyzers Are Typical Comprised Of Tokenizers And TokenFilters
  • Tokenizer: Controls How Your Text Is Tokenized
  • TokenFilter: Mutates And Manipulates The Stream Of Tokens
  • Solr Lets You Mix And Match Tokenizers and TokenFilters In Your schema.xml To Define Analyzers On The Fly
  • Example:

<fieldType name=”text” class=”solr.TextField”> <analyzer type=”index”>
<tokenizer class=”solr.WhitespaceTokenizerFactory”/>
</analyzer>
<analyzer type=”query”>
<tokenizer class=”solr.WhitespaceTokenizerFactory”/>
<filter class=”solr.SynonymFilterFactory” synonyms=”synonyms.txt” expand=”true”/>
</analyzer>

solrconfig.xml:

This is where you configure options for how this Solr instance should behave.Low-Level Index Settings

  • Performance Settings (Cache Sizes, etc…)
  • Types of Updates Allowed
  • Types of Queries Allowed

Note:    
● solrconfig.xml depends on schema.xml.
   ● schema.xml does not depend on solrconfig.xml.

Projects & Libraries – Begining Ruby Guide!

  • Load and require will load the file mentioned, the difference is, load will every time its called, where as require will load once within the scope. When we use this, it looks into the current and some other directories to search for the file. The place where it has the directory list is on $LOAD_PATH. We can check it:
    $:.each {|fname| puts fname}

    To add more file to it:

     $:.push '/your/directory/here'
     require 'yourfile'

PART 2: Classes/Objects/Modules – Begining Ruby Guide!

  • Classes: A class is a collection of methods and data that are used as a blueprint to create multiple objects relating to that class.
  • Objects: An object is a single instance of a class. An object of class is a single person. An object of class is a single dog. If you think of objects as real-life objects, a class is the classification, whereas an object is the actual object or “thing” itself.
  • Local variable: A variable that can only be accessed and used from the current scope. Instance/object variable: A variable that can be accessed and used from the scope of a single object. An object’s methods can all access that object’s object variables.
  • Global variable: A variable that can be accessed and used from anywhere within the current program.
  • Class variable: A variable that can be accessed and used within the scope of a class and all of its child objects.
  • Encapsulation: The concept of allowing methods to have differing degrees of visibility outside of their class or associated object.
  • Polymorphism: The concept of methods being able to deal with different classes of data and offering a more generic implementation (as with the and methods offered by your and classes).
  • Module: An organizational element that collects together any number of classes, methods, and constants into a single namespace.
  • Namespace: A named element of organization that keeps classes, methods, and constants from clashing.
  • Mix-in: A module that can mix its methods in to a class to extend that class’s functionality.
  • Enumerable: A mix-in module, provided as standard with Ruby, that implements iterators and list-related methods for other classes, such as , , , and . Ruby uses this module by default with the and classes.
  • Comparable: A mix-in module, provided as standard with Ruby, that implements comparison operators (such as , , and ) on classes that implement the generic comparison operator .

PART 1: Basics – Begining Ruby Guide!

I started reading the book “Beginning Ruby Guide” today, here are some important points to keep in mind(from PART 1 section of the book ):

  • Everything in Ruby is a object: ‘When you write a simple sum such as 2 + 2, you expect the computer to add two numbers together to make 4. In its object-oriented way, Ruby considers the two numbers (2 and 2) to be number objects. 2 + 2 is then merely shorthand for asking the first number object to add the second number object to itself. In fact, the + sign is actually an addition method! (It’s true, 2.+(2) will work just fine!). You can prove that everything in Ruby is an object by asking the things which class they’re a member of.

  • Kernel is a special class whose methods are made available in every class and scope throughout Ruby. You’ve used a key method provided by Kernel already, for ex: puts.

  • Concept of subroutine: even though almost everything in Ruby is an object, you can use Ruby in the same way as a non–object-oriented language if you like, even if it’s less than ideal. Like, you can just define a method and use it without any association of a  object.
  • Interesting comparison operator: x <=>y (returns 0 if x and y are equal, 1 if x is higher, and -1 if y is higher).
  • You’ll be using this style( 5.times { puts “Test” } ) for single lines of code from here on, but will be using do and end for longer blocks of code. This is a good habit to pick up, as it’s the style nearly all professional Ruby developers follow.
  • Constants start with capital and can not be changed once declared.
  • Text: to include multiple lines of texts in the span you do: x = %q{This is multiple line}, another way:
    x = <<END_MY_STRING_PLEASE
    This is the string
    And a second line
    END_MY_STRING_PLEASE
  • You can embed expressions (and even logic) directly into strings. This process is called interpolation. In this situation, interpolation refers to the process of inserting the result of an expression into a string literal. The way to interpolate within a string is to place the expression within #{ and } symbols. An even more basic example demonstrates:

           puts "100 * 5 = #{100 * 5}"
  • sub method substitute once, whereas gsub substitutes multiple occurrences. EX: puts “this is a test”.gsub(‘i’, ”). Read more on regular expression later******.
  • Defining x << 4 and x.push(4) for an array is same!
  • [1, 2, 3, 4].collect { |element| element * 2 } : collect iterates through an array element by element, and assigns to that element the result of any expression within the code block. In this example, you multiply the value of the element by 2.
  • Hash: x.each { |key, value| puts “#{key} equals #{value}” }. Note**: In Ruby 1.8, there is no guarantee that elements will be returned in a specific order. In Ruby 1.9, however, the order in which the elements were inserted into the hash will be remembered, and each will return them in that order.
  • Code Blocks: each_vowel is a method that accepts a code block, as designated by the ampersand (&) before the variable name code_block in the method definition. It then iterates over each vowel in the literal array %w{a e i o u} and uses the call method on code_block to execute the code block once for each vowel, passing in the vowel variable as a parameter each time.
    Note** Code blocks passed in this way result in objects that have many methods of their own, such as call. Remember, almost everything in Ruby is an object! (Many elements of syntax are not objects, nor are code blocks in their literal form.)

    def each_vowel(&code_block)
     %w{a e i o u}.each { |vowel| code_block.call(vowel) }
    end
    each_vowel { |vowel| puts vowel }
    
    Another alternative is to use yield:
    def each_vowel
     %w{a e i o u}.each { |vowel| yield vowel }
    end
  • It’s also possible to store code blocks within variables, using the lambda method:

           print_parameter_to_screen = lambda { |x| puts x }
           print_parameter_to_screen.call(100)
  • Other languages often have limitations on the size of numbers that can be represented. Commonly this is 32 binary bits, resulting in a limit on values to roughly 4.2 billion in lan- guages that enforce 32-bit integers. Most operating systems and computer architectures also have similar limitations. Ruby, on the other hand, seamlessly converts between numbers that the computer can handle natively (that is, with ease) and those that require more work. It does this with different classes, one called Fixnum that represents easily managed smaller num- bers, and another, aptly called Bignum, that represents “big” numbers.

Notes: Ruby on Rails (Basics)

Notes taken from the book: Rails 3 in Action

Basics:

  • Ruby on Rails is a framework built on the Ruby language.
  • The Ruby language was created back in 1993 by Yukihiro “Matz” Matsumuto.
  • Ruby on Rails was created in 2004 by David Heinemeier Hansson during the devel- opment of 37signals’ flagship product: Basecamp. When Rails was needed for other 37signals projects, the team extracted the Rails code from it.
  • Ruby on Rails allows for rapid development of applications by using a concept known as convention over configuration.
  • The core features of Rails are a conglomerate of many different parts called Rail- ties (when said aloud it rhymes with “bowties”), such as Active Record, Active Support, Action Mailer, and Action Pack.
  • MVC in Rails is aided by REST, a routing paradigm. Representational State Transfer (REST) is the convention for routing in Rails. When something adheres to this conven- tion, it’s said to be RESTful. Routing in Rails refers to how requests are routed within the application itself.

First Application:

  • Use RVM (http://rvm.beginrescueend.com) to install Ruby and RubyGems.
  • Genrate an application: rails new things_i_bought
  • Starting the Application: cd things_i_bought -> bundle install -> rails server
  • Genrate scaffold: rails generate scaffold purchase name:string cost:float
  • Migrations: Migrations are used in Rails as a form of version control for the database, providing a way to implement incremental changes to the schema of the database. Each migration is timestamped right down to the second, which provides you (and anybody else devel- oping the application with you) an accurate timeline of your database.
  • respond_to method that defines what formats this action responds to. Here, the controller responds to the html and xml formats. The html method here isn’t given a block and so will render the template from app/views/ purchases/new.html.erb, whereas the xml method, which is given a block, will execute the code inside the block and return an XML version of the @purchase object.
  • A flash message is a message that can be displayed on the next request.
  • You can add validations to your model to ensure that the data conforms to certain rules or that data for a certain field must be present or that a number you enter must be above a certain other number. For example: validates_presence_of :name validates_numericality_of :cost, :greater_than => 0
  • Routing: The config/routes.rb file of every Rails application is where the application routes are defined. Inside the block for the draw method(in routes.rb) is the resources method. Collections of similar objects in Rails are referred to as resources. This method defines the routes and routing helpers.
  • By using the routing helpers introduced in Rails 2 and still available in Rails 3.1, you can have much shorter link_to calls in your application, increasing the readability of the code throughout.

Finally managed to finish the first chapter with a test application running on my iMac 🙂

Hadoop & MapReduce: First Look!

I have started reading the book, Hadoop, The Definitive Guide.
Here are the notes from the first chapter: Meet Hadoop:

  • Lets Talk About How Much Data Flowing Around:
    • NY Stock Exchange generates about one terabytes of new trade data every day!
    • Facebook hosts approximately 10 billion photos, taking up one petabyte storage!
  • What Hadoop Provides:
    • A realiable shared storage and analysis system. the storage is provided by HDFS and analysis is provided by MapReduce. There are other parts of Hadoop but these two are its kernel.
  • What is MapReduce:
    • MapReduce is a batch query processor, and the ability to run an adhoc query against your whole data set, and get the result in a reasonable time is trans-formative.
    • Traditional RDBMS are B-Tree implementation, and seek time is very high. To update majority of the DB, B-Tree is much less efficient than MapReduce which uses Sort/Merge to rebuild the DB.
    • For adhoc analysis MapReduce is very good.
    • MapReduce works well on unstructured or semi structured data, since it is designed to imterpret data at processing time. MapReduce is not an intrinsic part of data, but they are chosen by the person analyzing the data.
    • MapReduce is linearly scalable programming model. The programmer writes two function, a map and reduce. Each of which defines a mapping from one set of key-value pairs to another. These function are oblivious to the data or cluster they are operating on, so any change shouldn’t impact them.
    • MapReduce spares the programmer from having to think about failure, since the implementation detects the failed map or reduce tasks and reschedule replacements on machines that are healthy.
  • Where Hadoop Stands Now:
    • As per May 2009, Hadoop sor one terabyte in 62 seconds, broke world record for sorting!

Am very happy that am starting to learn such wonderful technology!!

String replace and Lightbox

Things Learned:
To replace a string in ruby, we can do the following way: “abc %s hij” % “efg”, however, the problem is this doesn’t replace multiple strings, like, if we have “abc %s hij %s” and we want to do something like “abc %s hij %s” % “efg”, it won’t work. So, I had to use gsub. I wasn’t happy about is as gsum is kind of recursive for multiple string replacement. I was looking for somehting like “abc %1 hij %2 hjk %1 ” % “efg”, “gtl”, in Java similar things can be done, where %1 will be replaced with ‘efg’ and %2 will be replaced with ‘gtl’. Unfortunately, ruby doesn’t support this!

Using 410 status code: when you have a page which used to exist but has been deleted, than you rather want to use HTTP status code 410 Gone than 404 NOt Found.

New things I have heard of: Lightbox, I never used it, but I just learned about it today. From the website: http://www.huddletogether.com/projects/lightbox/, here are some details:

“Lightbox JS is a simple, unobtrusive script used to overlay images on the current page. It’s a snap to setup and works on all modern browsers.”

– Include lightbox.js in your header.

– Add rel=”lightbox” attribute to any link tag to activate the lightbox. For example:

image #1

Optional: Use the title attribute if you want to show a caption.

What I learned!

Sorry to myself, I had a flow for writing and then it stopped 😦 I need to work on it. I have been passing some very depressive time, and trying to figure out why and how to recover it! Well, its not going to be easy. But I’ll try. In the meantime, am planning to write down small little things I learn everyday. I am not sure how frequently I can do that. But I’ve to try. I know that life is like that, we try, we fail! But that doesn’t mean we should stop!

What I learned yesterday:

-> If I have a hash table like, hash_test = {:’Cat’ => “Meow”, :’Dog’ => ‘Tom’}, to add a entry to this hash, I should the following: hash_test.merge! ({:’Pet’ => ‘Wow’})
-> Used Datepicker jQuery function for picking date, its a cool tool. Details: http://jqueryui.com/demos/datepicker/
->Used .to_i to convert a string to integer.
-> Using named scope in Animal model so that can be called along with all other condition in every place. So we define it like:
named_scope :is_pet, {:conditions=>”animal.is_pet==’true'”}
now, later I can use it like: Animal.find_pets_with_ears(@num_of_ears).is_pet

I liked this use of named scope, need to understand it more, this is very useful!