Some Note From the Book “Solr 1.4 Enterprise Search Server”:
Solr Index:
An index is basically like a single-table database schema. Imagine a massive spreadsheet, if you will. Inspite of this limitation, there is nothing to stop you from putting different types of data (say, artists and tracks from MusicBrainz) into a single index, thereby, in effect mitigating this limitation. All you have to do is use different fields for the different document types, and use a field to discriminate between the types. An identifier field would need to be unique across all documents in this index, no matter the type, so you could easily do this by concatenating the field type and the entity’s identifier. This may appear really ugly from a relational database design standpoint, but this isn’t a database.
Single Combined Index:
<field name=”id” … /> <!– example: “artist:534445″ –>
<field name=”type” … … –> <field name=”name” … <!– track fields: –> <field name=”PUID” … /> <field name=”num” … /> <!– i.e. the track # on the release –>
Problems with that?
- There may be namespace collision problems unless you prefix the field names by type such as: artist_startDate and track_PUID.
- If you share the same field for different things (like the name field in the example that we have just seen), then there are some problems that can occur when using that field in a query and while filtering documents by document type.
- Prefix, wildcard, and fuzzy queries will take longer and will be more likely to reach internal scalability thresholds.
- Committing changes to a Solr index invalidates the caches used to speed up querying. If this happens often, and the changes are usually to one type of entity in the index, then you will get better query performance by using separate indices.
Schema Design:
While doing schema design, a key thing to come to grips with is that a Solr schema strategy is driven by how it is queried and not by a standard third normal form decomposition of the data.
- First determine which searches are going to be powered by Solr.
- Second determine the entities returned for each search.
- For each entity type, find all of the data in the schema that will be needed across all searches of it. By “all searches of it,” I mean that there might actually be multiple search forms, as identified in Step 1. Such data includes any data queried for (that is, criteria to determine whether a document matches or not) and any data that is displayed in the search results.For each entity type, find all of the data in the schema that will be needed across all searches of it. By “all searches of it,” I mean that there might actually be multiple search forms, as identified in Step 1. Such data includes any data queried for (that is, criteria to determine whether a document matches or not) and any data that is displayed in the search results.
- If there is any data shown on the search results that is not queryable, not sorted upon, not faceted on, nor are you using the highlighter feature for, and for that matter are not using any Solr feature that uses the field except to simply return it in search results, then it is not necessary to include it in the schema for this entity. Let’s say, for the sake of the argument, that the only information queryable, sortable, and so on is a track’s name, when doing a query for tracks. You can opt not to inline the artist name, for example, into the track entity. When your application queries Solr for tracks and needs to render search results with the artist’s name, the onus would be on your application to get this data from somewhere—it won’t be in the search results from Solr. The application might look these up in a database or perhaps even query Solr in its own artist entity if it’s there or somewhere else.
Field Types:
The first section of the schema is the definition of the field types. In other words, these are the data types. This section is enclosed in the <types/> tag and will consume lots of the file’s content. The field types declare the types of fields, such as booleans, numbers, dates, and various text flavors.
Using copyField:
Closely related to the field definitions are copyField directives, which are specified at some point after the fields element, not within it. A copyField directive looks like this:
<copyField source=”r_name” dest=”r_name_sort” />
These are really quite simple. At index-time, each copyField is evaluated for each input document.