Linux, DevOps, Middleware and Cloud: Apache Solr Installation

Sunday, April 14, 2013

Apache Solr Installation

Hi Friends,

In Last post we discuss about the Apache Solr and its features. Now in this post we will discuss about its Setup.

Setup

As the very first step, you should follow the official tutorial which covers the basic aspects of any search use case:

Indexing - Get the data of any form into Solr. Examples: JSON, XML, CSV and SQL-database. This step creates the inverted index - i.e. it links every term to its documents.
Querying - Ask Solr to return the most relevant documents for the users' query

To follow the official tutorial you'll have to download Java and the latest version of Solr here. More information about installation is available at the official description.

Next you'll have to decide which web server you choose for Solr. In the official tutorial, Jetty is used, but you can also use Tomcat/Jboss.

Indexing

If you've followed the official tutorial you have pushed some XML files into the Solr index. This process is called indexing or feeding. There are a lot more possibilities to get data into Solr:

Using the Data Import Handler (DIH) is a really powerful language neutral option. It allows you to read from a SQL database, from CSV, XML files, RSS feeds, Emails, etc. without any Java knowledge. DIH handles full-imports and delta-imports. This is necessary when only a small amount of documents were added, updated or deleted.
The HTTP interface is used from the post tool, which you have already used in the official tutorial to index XML files.
Client libraries in different languages also exist. (e.g. for Java (SolrJ) or Python).

Before indexing you'll have to decide which data fields should be searchable and how the fields should get indexed. For example, when you have a field with HTML in it, then you can strip irrelevant characters, tokenize the text into 'searchable terms', lower case the terms and finally stem the terms. In contrast, if you would have a field with text in it that should not be interpreted (e.g. URLs) youshouldn't tokenize it and use the default field type string. Please refer to the official documentationabout field and field type definitions in the schema.xml file.

When designing an index keep in mind the advice from Mauricio: "The document is what you will search for." For example, if you have tweets and you want to search for similar users, you'll need to setup a user index - created from the tweets. Then every document is a user. If you want to search for tweets, then setup a tweet index; then every document is a tweet. Of course, you can setup both indices with the multi index options of Solr.

Please also note that there is a project called Solr Cell which lets you extract the relevant information out of several different document types with the help of Tika.

Querying

For debugging it is very convenient to use the HTTP interface with a browser to query Solr and get back XML. Use Firefox and the XML will be displayed nicely:

You can also do a lot more; one other concept is boosting. In Solr you can boost while indexing and while querying. To prefer the terms in the title write:

q=title:superman^2 subject:superman

When using the dismax request handler write:

q=superman&qf=title^2 subject

Check out all the various query options like fuzzy search, spellcheck query input, facets, collapsingand suffix query support.

Hope this will help!!

Linux, DevOps, Middleware and Cloud

Sunday, April 14, 2013

Apache Solr Installation

Setup

Querying

No comments:

Post a Comment

Quickstart Guide for Kagent Setup with Local LM and Azure OpenAI

Search This Blog