Apache Solr – Linux Hint

Introduction to Apache Solr. Part 2: Querying Solr

Frank Hofmann — Tue, 02 Mar 2021 06:04:00 +0000

Apache Solr [1] is a search engine framework written in Java and based on the Lucene search library [6]. In the previous article, we set up Apache Solr on the soon-to-be-released Debian GNU/Linux 11, initiated a single data core, uploaded example data, and demonstrated how to do a basic search within the data set using a simple query.

This is a follow-up article to the previous one. We will cover how to refine the query, formulate more complex search criteria with different parameters, and understand the Apache Solr query page’s different web forms. Also, we will discuss how to post-process the search result using different output formats such as XML, CSV, and JSON.

Querying Apache Solr

Apache Solr is designed as a web application and service that runs in the background. The result is that any client application can communicate with Solr by sending queries to it (the focus of this article), manipulating the document core by adding, updating, and deleting indexed data, and optimizing core data. There are two options — via dashboard/web interface or using an API by sending a corresponding request.

It is common to use the first option for testing purposes and not for regular access. The figure below shows the Dashboard from the Apache Solr Administration User Interface with the different query forms in the web browser Firefox.

First, from the menu under the core selection field, choose the menu entry “Query”. Next, the dashboard will display several input fields as follows:

Request handler (qt):
Define which kind of request you would like to send to Solr. You can choose between the default request handlers “/select” (query indexed data), “/update” (update indexed data), and “/delete” (remove the specified indexed data), or a self-defined one.
Query event (q):
Define which field names and values to be selected.
Filter queries (fq):
Restrict the superset of documents that can be returned without affecting the document score.
Sort order (sort):
Define the sort order of the query results to either ascending or descending
Output window (start and rows):
Limit the output to the specified elements
Field list (fl):
Limits the information included in a query response to a specified list of fields.
Output format (wt):
Define the desired output format. The default value is JSON.

Clicking on the Execute Query button runs the desired request. For practical examples, have a look below.

As the second option, you can send a request using an API. This is an HTTP request that can be sent to Apache Solr by any application. Solr processes the request and returns an answer. A special case of this is connecting to Apache Solr via Java API. This has been outsourced to a separate project called SolrJ [7] — a Java API without requiring an HTTP connection.

Query syntax

The query syntax is best described in [3] and [5]. The different parameter names directly correspond with the names of the entry fields in the forms explained above. The table below lists them, plus practical examples.

Query Parameters Index

Parameter	Description	Example
q	The main query parameter of Apache Solr — the field names and values. Their similarity scores document to terms in this parameter.	Id:5 cars:adilla *:X5
fq	Restrict the result set to the superset documents that match the filter, for example, defined via Function Range Query Parser	model id,model
start	Offsets for page results (begin). The default value of this parameter is 0.	5
rows	Offsets for page results (end). The value of this parameter is 10 by default	15
sort	It specifies the list of fields separated by commas, based on which the query results are to be sorted	model asc
fl	It specifies the list of the fields to return for all the documents in the result set	model id,model
wt	This parameter represents the type of response writer we wanted to view the result. The value of this is JSON by default.	json xml

Searches are done via HTTP GET request with the query string in the q parameter. The examples below will clarify how this works. In use is curl to send the query to Solr that is installed locally.

Retrieve all the datasets from the core cars
curl http://localhost:8983/solr/cars/query?q=*:*
Retrieve all the datasets from the core cars that have an id of 5
curl http://localhost:8983/solr/cars/query?q=id:5
Retrieve the field model from all the datasets of the core cars
Option 1 (with escaped &):

curl http://localhost:8983/solr/cars/query?q=id:*\&fl=model

Option 2 (query in single ticks):

curl 'http://localhost:8983/solr/cars/query?q=id:*&fl=model'
Retrieve all datasets of the core cars sorted by price in descending order, and output the fields make, model, and price, only (version in single ticks):
curl http://localhost:8983/solr/cars/query -d '
q=*:*&
sort=price desc&
fl=make,model,price '
Retrieve the first five datasets of the core cars sorted by price in descending order, and output the fields make, model, and price, only (version in single ticks):
curl http://localhost:8983/solr/cars/query -d '
q=*:*&
rows=5&
sort=price desc&
fl=make,model,price '
Retrieve the first five datasets of the core cars sorted by price in descending order, and output the fields make, model, and price plus its relevance score, only (version in single ticks):
curl http://localhost:8983/solr/cars/query -d '
q=*:*&
rows=5&
sort=price desc&
fl=make,model,price,score '
Return all stored fields as well as the relevance score:
curl http://localhost:8983/solr/cars/query -d '
q=*:*&
fl=*,score '

Furthermore, you can define your own request handler to send the optional request parameters to the query parser in order to control what information is returned.

Query Parsers

Apache Solr uses a so-called query parser — a component that translates your search string into specific instructions for the search engine. A query parser stands between you and the document that you are searching for.

Solr comes with a variety of parser types that differ in the way a submitted query is handled. The Standard Query Parser works well for structured queries but is less tolerant of syntax errors. At the same time, both the DisMax and Extended DisMax Query Parser are optimized for natural language-like queries. They are designed to process simple phrases entered by users and to search for individual terms across several fields using different weighting.

Furthermore, Solr also offers so-called Function Queries that allow a function to be combined with a query in order to generate a specific relevance score. These parsers are named Function Query Parser and Function Range Query Parser. The example below shows the latter one to pick all the data sets for “bmw” (stored in the data field make) with the models from 318 to 323:

curl http://localhost:8983/solr/cars/query -d '
q=make:bmw&
fq=model:[318 TO 323] '

Post-processing of results

Sending queries to Apache Solr is one part, but post-processing the search result from the other one. First, you can choose between different response formats — from JSON to XML, CSV, and a simplified Ruby format. Simply specify the corresponding wt parameter in a query. The code example below demonstrates this for retrieving the dataset in CSV format for all the items using curl with escaped &:

curl http://localhost:8983/solr/cars/query?q=id:5\&wt=csv

The output is a comma-separated list as follows:

In order to receive the result as XML data but the two output fields make and model, only, run the following query:

curl http://localhost:8983/solr/cars/query?q=*:*\&fl=make,model\&wt=xml

The output is different and contains both the response header and the actual response:

Wget simply prints the received data on stdout. This allows you to post-process the response using standard command-line tools. To list a few, this contains jq [9] for JSON, xsltproc, xidel, xmlstarlet [10] for XML as well as csvkit [11] for CSV format.

Conclusion

This article shows different ways of sending queries to Apache Solr and explains how to process the search result. In the next part, you will learn how to use Apache Solr to search in PostgreSQL, a relational database management system.

About the authors

Jacqui Kabeta is an environmentalist, avid researcher, trainer, and mentor. In several African countries, she has worked in the IT industry and NGO environments.

Frank Hofmann is an IT developer, trainer, and author and prefers to work from Berlin, Geneva, and Cape Town. Co-author of the Debian Package Management Book available from dpmb.org

Links and References

[1] Apache Solr, https://lucene.apache.org/solr/
[2] Frank Hofmann and Jacqui Kabeta: Introduction to Apache Solr. Part 1, http://linuxhint.com
[3] Yonik Seelay: Solr Query Syntax, http://yonik.com/solr/query-syntax/
[4] Yonik Seelay: Solr Tutorial, http://yonik.com/solr-tutorial/
[5] Apache Solr: Querying Data, Tutorialspoint, https://www.tutorialspoint.com/apache_solr/apache_solr_querying_data.htm
[6] Lucene, https://lucene.apache.org/
[7] SolrJ, https://lucene.apache.org/solr/guide/8_8/using-solrj.html
[8] curl, https://curl.se/
[9] jq, https://github.com/stedolan/jq
[10] xmlstarlet, http://xmlstar.sourceforge.net/
[11] csvkit, https://csvkit.readthedocs.io/en/latest/

Apache Solr: Setup a Node

Frank Hofmann — Sun, 21 Feb 2021 14:03:46 +0000

Part 1: Setting up a single node

Today, electronically storing your documents or data on a storage device is both quick and easy, it is comparably cheap, too. In use is a filename reference that is meant to describe what the document is about. Alternatively, data is kept in a Database Management System (DBMS) like PostgreSQL, MariaDB, or MongoDB to just name a few options. Several storage mediums are either locally or remotely connected to the computer, such as USB stick, internal or external hard disk, Network Attached Storage (NAS), Cloud Storage, or GPU/Flash-based, as in an Nvidia V100 [10].

In contrast, the reverse process, finding the right documents in a document collection, is rather complex. It mostly requires detecting the file format without fault, indexing the document, and extracting the key concepts (document classification). This is where the Apache Solr framework comes in. It offers a practical interface to do the steps mentioned — building a document index, accepting search queries, doing the actual search, and returning a search result. Apache Solr thus forms the core for effective research on a database or document silo.

In this article, you will learn how Apache Solr works, how to set up a single node, index documents, do a search, and retrieve the result.

The follow-up articles build on this one, and, in them, we discuss other, more specific use cases such as integrating a PostgreSQL DBMS as a data source or load balancing across multiple nodes.

About the Apache Solr project

Apache Solr is a search engine framework based on the powerful Lucene search index server [2]. Written in Java, it is maintained under the umbrella of the Apache Software Foundation (ASF) [6]. It is freely available under the Apache 2 license.

The topic “Find documents and data again” plays a very important role in the software world, and many developers deal with it intensively. The website Awesomeopensource [4] lists more than 150 search engine open-source projects. As of early 2021, ElasticSearch [8] and Apache Solr/Lucene are the two top dogs when it comes to searching for larger data sets. Developing your search engine requires a lot of knowledge, Frank does that with the Python-based AdvaS Advanced Search [3] library since 2002.

Setting up Apache Solr:

The installation and operation of Apache Solr are not complicated, it is simply a whole series of steps to be carried out by you. Allow about 1 hour for the result of the first data query. Furthermore, Apache Solr is not just a hobby project but is also used in a professional environment. Therefore, the chosen operating system environment is designed for long-term use.

As the base environment for this article, we use Debian GNU/Linux 11, which is the upcoming Debian release (as of early 2021) and expected to be available in mid-2021. For this tutorial, we expect that you have already installed it,–either as the native system, in a virtual machine like VirtualBox, or an AWS container.

Apart from the basic components, you need the following software packages to be installed on the system:

Curl
Default-java
Libcommons-cli-java
Libxerces2-java
Libtika-java (a library from the Apache Tika project [11])

These packages are standard components of Debian GNU/Linux. If not yet installed, you can post-install them in one go as a user with administrative rights, for example, root or via sudo, shown as follows:

# apt-get install curl default-java libcommons-cli-java libxerces2-java libtika-java

Having prepared the environment, the 2nd step is the installation of Apache Solr. As of now, Apache Solr is not available as a regular Debian package. Therefore, it is required to retrieve Apache Solr 8.8 from the download section of the project website [9] first. Use the wget command below to store it in the /tmp directory of your system:

$ wget -O /tmp https://downloads.apache.org/lucene/solr/8.8.0/solr-8.8.0.tgz

The switch -O shortens –output-document and makes wget store the retrieved tar.gz file in the given directory. The archive has a size of roughly 190M. Next, unpack the archive into the /opt directory using tar. As a result, you will find two subdirectories — /opt/solr and /opt/solr-8.8.0, whereas /opt/solr is set up as a symbolic link to the latter one. Apache Solr comes with a setup script that you execute next, it is as follows:

# /opt/solr-8.8.0/bin/install_solr_service.sh

This results in the creation of the Linux user solr runs in the Solr service plus his home directory under /var/solr establishes the Solr service, added with its corresponding nodes, and starts the Solr service on port 8983. These are the default values. If you are unhappy with them, you can modify them during installation or even latersince the installation script accepts corresponding switches for setup adjustments. We recommend you to have a look at the Apache Solr documentation regarding these parameters.

The Solr software is organized in the following directories:

bin
contains the Solr binaries and files to run Solr as a service
contrib
external Solr libraries such as data import handler and the Lucene libraries
dist
internal Solr libraries
docs
link to the Solr documentation available online
example
example datasets or several use cases/scenarios
licenses
software licenses for the various Solr components
server
server configuration files, such as server/etc for services and ports

In more detail, you can read about these directories in the Apache Solr documentation [12].

Managing Apache Solr:

Apache Solr runs as a service in the background. You can start it in two ways, either using systemctl (first line) as a user with administrative permissions or directly from the Solr directory (second line). We list both terminal commands below:

# systemctl start solr
$ solr/bin/solr start

Stopping Apache Solr is done similarly:

# systemctl stop solr
$ solr/bin/solr stop

The same way goes in restarting the Apache Solr service:

# systemctl restart solr
$ solr/bin/solr restart

Furthermore, the status of the Apache Solr process can be displayed as follows:

# systemctl status solr
$ solr/bin/solr status

The output lists the service file that was started, both the corresponding timestamp and log messages. The figure below shows that the Apache Solr service was started on port 8983 with process 632. The process is successfully running for 38 minutes.

To see if the Apache Solr process is active, you may also cross-check using the ps command in combination with grep. This limits the ps output to all the Apache Solr processes that are currently active.

# ps ax | grep --color solr

The figure below demonstrates this for a single process. You see the call of Java that is accompanied by a list of parameters, for example memory usage (512M) ports to listen on 8983 for queries, 7983 for stop requests, and type of connection (http).

Adding users:

The Apache Solr processes run with a specific user named solr. This user is helpful in managing Solr processes, uploading data, and sending requests. Upon setup, the user solr does not have a password and is expected to have one to log in to proceed further. Set a password for the user solr like user root, it is shown as follows:

# passwd solr

Solr Administration:

Managing Apache Solr is done using the Solr Dashboard. This is accessible via web browser from http://localhost:8983/solr. The figure below shows the main view.

On the left, you see the main menu that leads you to the subsections for logging, administration of the Solr cores, the Java setup, and the status information. Choose the desired core using the selection box below the menu. On the right side of the menu, the corresponding information is displayed. The Dashboard menu entry shows further details regarding the Apache Solr process, as well as the current load and memory usage.

Please know that the contents of the Dashboard changes depending on the number of Solr cores, and the documents that have been indexed. Changes affect both the menu items and the corresponding information that is visible on the right.

Understanding How Search Engines Work:

Simply speaking, search engines analyze documents, categorize them, and allow you to do a search based on their categorization. Basically, the process consists of three stages, which are termed as crawling, indexing, and ranking [13].

Crawling is the first stage and describes a process by which new and updated content is collected. The search engine uses robots that are also known as spiders or crawlers, hence the term crawling to go through available documents.

The second stage is called indexing. The previously collected content is made searchable by transforming the original documents into a format the search engine understands. Keywords and concepts are extracted and stored in (massive) databases.

The third stage is called ranking and describes the process of sorting the search results according to their relevance with a search query. It is common to display the results in descending order so that the result that has the highest relevance to the searcher’s query comes first.

Apache Solr works similarly to the previously described three-stage process. Like the popular search engine Google, Apache Solr uses a sequence of gathering, storing, and indexing documents from different sources and makes them available/searchable in near real-time.

Apache Solr uses different ways to index documents including the following [14]:

Using an Index Request Handler when uploading the documents directly to Solr. These documents should be in JSON, XML/XSLT, or CSV formats.
Using the Extracting Request Handler (Solr Cell). The documents should be in PDF or Office formats, which are supported by Apache Tika.
Using the Data Import Handler, which conveys data from a database and catalogs it using column names. The Data Import Handler fetches data from emails, RSS feeds, XML data, databases, and plain text files as sources.

A query handler is used in Apache Solr when a search request is sent. The query handler analyzes the given query based on the same concept of the index handler to match the query and previously indexed documents. The matches are ranked according to their appropriateness or relevance. A brief example of querying is demonstrated below.

Uploading Documents:

For the sake of simplicity, we use a sample dataset for the following example that is already provided by Apache Solr. Uploading documents is done as the user solr. Step 1 is the creation of a core with the name techproducts (for a number of tech items).

$ solr/bin/solr create -c techproducts

Everything is fine if you see the message “Created new core ‘techproducts’”. Step 2 is adding data (XML data from exampledocs) to the previously created core techproducts. In use is the tool post that is parameterized by -c (name of the core) and the documents to be uploaded.

$ solr/bin/post -c techproducts solr/example/exampledocs/*.xml

This will result in the output shown below and will contain the entire call plus the 14 documents that have been indexed.

Also, the Dashboard shows the changes. A new entry named techproducts is visible in the dropdown menu on the left side, and the number of corresponding documents changed on the right side. Unfortunately, a detailed view of the raw datasets is not possible.

In case the core/collection needs to be removed, use the following command:

$ solr/bin/solr delete -c techproducts

Querying Data:

Apache Solr offers two interfaces to query data: via the web-based Dashboard and command-line. We will explain both methods below.

Sending queries via Solr dashboard is done as follows:

Choose the node techproducts from the dropdown menu.
Choose the entry Query from the menu below the dropdown menu.
Entry fields pop up on the right side to formulate the query like request handler (qt), query (q), and the sort order (sort).
Choose the entry field Query, and change the content of the entry from “*:*” to “manu:Belkin”. This limits the search from “all fields with all entries” to “datasets that have the name Belkin in the manu field”. In this case, the name manu abbreviates manufacturer in the example data set.
Next, press the button with Execute Query. The result is a printed HTTP request on top, and a result of the search query in JSON data format below.

The command-line accepts the same query as in the Dashboard. The difference is that you must know the name of the query fields. In order to send the same query like above, you have to run the following command in a terminal:

$ curl
http://localhost:8983/solr/techproducts/query?q=”manu”:”Belkin

The output is in JSON format, as shown below. The result consists of a response header and the actual response. The response consists of two data sets.

Wrapping Up:

Congratulations! You have achieved the first stage with success. The basic infrastructure is set up, and you have learned how to upload and query documents.

The next step will cover how to refine the query, formulate more complex queries, and understand the different web forms provided by the Apache Solr query page. Also, we will discuss how to post-process the search result using different output formats such as XML, CSV, and JSON.

About the authors:

Jacqui Kabeta is an environmentalist, avid researcher, trainer, and mentor. In several African countries, she has worked in the IT industry and NGO environments.

Frank Hofmann is an IT developer, trainer, and author and prefers to work from Berlin, Geneva, and Cape Town. Co-author of the Debian Package Management Book available from dpmb.org

[1] Apache Solr, https://lucene.apache.org/solr/
[2] Lucene Search Library, https://lucene.apache.org/
[3]AdvaS Advanced Search, https://pypi.org/project/AdvaS-Advanced-Search/
[4] The Top 165 Search Engine Open Source Projects, https://awesomeopensource.com/projects/search-engine
[5] ElasticSearch, https://www.elastic.co/de/elasticsearch/
[6]Apache Software Foundation (ASF), https://www.apache.org/
[7]FESS, https://fess.codelibs.org/index.html
[8] ElasticSearch, https://www.elastic.co/de/
[9] Apache Solr, Download section, https://lucene.apache.org/solr/downloads.htm
[10] Nvidia V100, https://www.nvidia.com/en-us/data-center/v100/
[11] Apache Tika, https://tika.apache.org/
[12] Apache Solr directory layout, https://lucene.apache.org/solr/guide/8_8/installing-solr.html#directory-layout
[13] How Search Engines Work: Crawling, Indexing, and Ranking. The beginners guide to SEO https://moz.com/beginners-guide-to-seo/how-search-engines-operate
[14] Get Started with Apache Solr, https://sematext.com/guides/solr/#:~:text=Solr%20works%20by%20gathering%2C%20storing,with%20huge%20volumes%20of%20data

Best Self-Hosted Search Engines

David Morelo — Tue, 31 Jul 2018 09:27:46 +0000

Does your boss know that you’re looking for another job? Have you told your significant other about the inability to decide whether you want to have children or not? Do you parents know about your sexual orientation? Well, Google and other major search engines do.

“Most users search Google while signed in, so all of the information on their online life is available: YouTube searches, emails, and past search history,” says Adam Tauber, the lead developer of privacy-respecting metasearch engine Searx.

Of course, you could use Tor for anonymity and always delete all traces of your activity after each search, but doing so after each and every search would most likely get old pretty quickly. Instead, you should consider installing a self-hosted search engine capable of retrieving information for you without disclosing anything sensitive about you.

We have selected two such search engines, and we also introduce three additional search engines to show you that excellent alternatives to proprietary search engines such as Google or Bing already exist and are easier to install and use than you might think.

1. YaCy

YaCy is a free distributed peer-to-peer search engine whose core component is written in Java. Because all YaCy users are equal, and because the search engine doesn’t store user search requests, censorship is simply not possible.

Currently, YaCy indexes about 1.4 billion documents in its index thanks to the activity of more than 600 peer operators who contribute to it each month. For comparison, the Google Search index contains hundreds of billions of webpages and is well over 100,000,000 gigabytes in size.

While YaCy still has a long way to go before it can rival the largest centralized search engines in the world, it’s already usable as a search portal for private intranets and project-specific applications because YaCy can operate as a single search appliance without networking with other peers.

YaCy can be easily integrated into any web page thanks to its simple code snippets that can be effortlessly copied and pasted without any modification.

2. Searx

Searx is described as a privacy-respecting, hackable metasearch engine. It’s available under the GNU Affero General Public License version 3, and its main goal is to protect the privacy of its users by never sharing users’ IP addresses or search history with the search engines from which it gathers results.

“When using Searx, the IP address of Searx, a random User-Agent and a search query is sent to Google by default,” Adam Tauber, aka asciimoo, explains how his metasearch engine works. “Of course, you can customize Searx to forward other extra parameters like search language or the page number of the requested result page.”

Searx automatically blocks all tracking cookies served by the search engines to prevent user-profiling-based results modification, which can result from a search engine trying to implement search which is individualized based on what the engine knows about the user. Searx is 100 percent free, and anyone can modify it as needed. You can even take the Searx code and run the metasearch engine on your own server, which should definitely address any concerns you might have regarding logs.

3. ElasticSearch

ElasticSearch is a search engine based on Lucene, a free and open-source information retrieval software library supported by the Apache Software Foundation and is released under the Apache Software License.

ElasticSearch provides a full-text search engine with an HTTP web interface. The search engine can be used to search all kinds of documents, and it can be easily distributed across multiple nodes.

It’s possible to build a self-hosted search engine using ElasticSearch and Docker, and you can find a tutorial that describes the process here.

4. Ambar

Ambar is an open-source document search engine with many useful features. It supports automated crawling, tagging, and instant full-text search, just to give a few examples. One of the most exciting features of Ambar is its ability to perform OCR on images and PDF files. The supported languages include English, German, Russian, Italian, French, Spanish, Polish, and Dutch.

Ambar can be easily deployed with a single docker-compose file, and you can learn how to do it here.

5. Apache Solr

Written in Java, Apache Solr is an enterprise search platform that includes full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, and many other important features. It was created in 2004 for an in-house project at CNET Networks. CNET Networks kindly donated it to the Apache Software Foundation in 2006, where it graduated from incubation status into a standalone top-level project in 2007.

Today, Solr is a highly reliable, scalable, and fault tolerant, enterprise search platform that powers the search and navigation features of many of the world’s largest internet sites, including DuckDuckGo, eHarmony, and BestBuy. You can

How to Install and Configure YaCy

The installation of YaCy is very simple, and it takes only a couple of minutes because you don’t need to install an external database or web server—YaCy comes with everything needed.

Go to the official website of YaCy and download the latest package for Linux.
Install the OpenJDK 8 runtime environment.
- If you’re using a Debian-based distribution, use the following command: $ sudo apt-get install openjdk-8-jre
- If not, follow the instructions specific for your distribution.
Extract the downloaded package to your preferred location.
Go to the new folder and start the “startYACY.sh” script in Terminal.
You should see a confirmation message informing you that YaCy started as a daemon

Conclusion

Search engines know more about us than most people would like to admit. If you would like to stop feeding big corporations with juicy data, you can take things into your own hands and set up a self-hosted search engine to protect your privacy. Although self-hosted search engines still have a long way to go to become fully usable, the potential for them to outperform the likes of Google is there and capturing it is just a matter of attracting more users.

Apache Solr Tutorial

Shubham Aggarwal — Wed, 21 Mar 2018 16:45:22 +0000

In this lesson, we will see how we can use Apache Solr to store data and how we can run various queries upon it.

What is Apache Solr

Apache Solr is one of the most popular NoSQL databases which can be used to store data and query it in near real-time. It is based on Apache Lucene and is written in Java. Just like Elasticsearch, it supports database queries through REST APIs. This means that we can use simple HTTP calls and use HTTP methods like GET, POST, PUT, DELETE etc. to access data. It also provides an option to get data in the form of XML or JSON through the REST APIs.

Architecture: Apache Solr

Before we can start working with Apache Solr, we must understand the components that constitutes Apache Solr. Let’s have a look at some components it has:

Apache Solr Architecture

Note that only major components for Solr are shown in above figure. Let’s understand their functionality here as well:

Request Handlers: The requests a client makes to Solr are managed by a Request Handler. The request can be anything from adding a new record to update an index in Solr. Handlers identify the type of request from the HTTP method used with the request mapping.
Search Component: This is one of the most important component Solr is known for. Search Component takes care of performing search related operations like fuzziness, spell checks, term queries etc.
Query Parser: This is the component which actually parses the query a client passes to the request handler and breaks a query into multiple parts which can be understood by the underlying engine
Response Writer: This component is responsible for managing the output format for the queries passed to the engine. Response Writer allows us to provide an output in various formats like XML, JSON etc.
Analyzer/Tokenizer: Lucene Engine understands queries in the form of multiple tokens. Solr analyzes the query, breaks it into multiple tokens and passes it to the Lucene Engine.
Update Request Processor: When a query is run and it performs operations like updating an index and data related to it, the Update Request Processor component is responsible for managing the data in the index and modifying it.

Getting Started with Apache Solr

To start using Apache Solr, it must be installed on the machine. To do this, read Install Apache Solr on Ubuntu.

Make sure you have an active Solr installation if you want to try examples we present later in the lesson and admin page is reachable on localhost:

Apache Solr Homepage

Inserting Data

To start, let us consider a Collection in Solr which we call as linux_hint_collection. There is no need to explicitly define this collection as when we insert the first object, the collection will be made automatically. Let’s try our first REST API call to insert a new object into the collection named linux_hint_collection.

Inserting Data

curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/linux_hint_collection/update/json/docs' --data-binary '
{
"id": "iduye",
"name": "Shubham"
}'

Here is what we get back with this command:

Command to insert data into Solr

Data can also be inserted using the Solr Homepage we looked at earlier. Let’s try this here so that things are clear:

Insert Data via Solr Homepage

As Solr has an excellent way of interaction with HTTP RESTful APIs, we will be demonstrating DB interaction using the same APIs from now onwards and won’t focus much on inserting data through the Solr Webpage.

List All Collections

We can list all collections in Apache Solr using a REST API as well. Here is the command we can use:

List All Collections

curl http://localhost:8983/solr/admin/collections?actions=LIST&wt=json

Let’s see the output for this command:

We see two collections here which exist in our Solr installation.

Get Object by ID

Now, let us see how we can GET data from Solr collection with a specific ID. Here is the REST API command:

Get Object by ID

curl http://localhost:8983/solr/linux_hint_collection/get?id=iduye

Here is what we get back with this command:

Get All Data

In our last REST API, we queried data using a specific ID. This time, we will get all data present in our Solr collection.

Get Object by ID

curl http://localhost:8983/solr/linux_hint_collection/select?q=*:*

Here is what we get back with this command:

Notice that we have used ‘*:*’ in query parameter. This specifies that Solr should return all data present in the collection. Even if we have specified that all data should be returned, Solr understands that the collection might have large amount of data in it and so, it will only return first 10 documents.

Deleting All Data

Till now, all APIs we tried were using a JSON format. This time, we will give a try to XML query format. Using XML format is extremely similar to JSON as XML is self-descriptive as well.

Let’s try a command to delete all data we have in our collection.

Deleting All Data

curl "http://localhost:8983/solr/linux_hint_collection/update?commit=true" -H "Content-Type: text/xml" --data-binary "*:*"

Here is what we get back with this command:

Delete all data using XML query

Now, if we again try getting all data, we will see that no data is available now:

Get All data

Total Object Count

For a final CURL command, let’s see a command with which we can find the number of objects which are present in an index. Here is the command for the same:

Total Object Count

curl http://localhost:8983/solr/linux_hint_collection/query?debug=query&q=*:*

Here is what we get back with this command:

Count number of Objects

Conclusion

In this lesson, we looked at how we can use Apache Solr and pass queries using curl in both JSON and XML format. We also saw that the Solr admin panel is useful in same manner as all curl commands we studied.

Install Apache Solr on Ubuntu

Shubham Aggarwal — Thu, 01 Mar 2018 12:48:59 +0000

In this quick post, we will see how we can install one of the most popular distributed free-text search databases, Apache Solr on Ubuntu and start using it as well. We will get started now .Read posts about Neo4J, Elasticsearch and MongoDB as well.

Apache Solr

In this lesson, we will study how to install Apache Solr on Ubuntu and start working with it through a basic set of Database queries.

Installing Java

To install Solr on Ubuntu, we must install Java first. Java might not be installed by default. We can verify it by using this command:

java -version

When we run this command, we get the following output:

We will now install Java on our system. Use this command to do so:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

Once these commands are done running, we can again verify that Java is now installed by using the same command.

Installing Apache Solr

We will now start with installing Apache Solr which is actually just a matter of a few commands.

To install Solr, we must know that Solr doesn’t work and run on its own, rather, it needs a Java Servlet container to run, for example, Jetty or Tomcat Servlet containers. In this lesson, we will be using Tomcat server but using Jetty is fairly similar.

The good thing about Ubuntu is that it provides three packages with which Solr can be easily installed and started. They are:

solr-common
solr-tomcat
solr-jetty

It is self-descriptive that solr-common is needed for both containers whereas solr-jetty is needed for Jetty and solr-tomcat is needed only for Tomcat server. As we have already installed Java, we can download the Solr package using this command:

sudo wget http://www-eu.apache.org/dist/lucene/solr/7.2.1/solr-7.2.1.zip

As this package brings a lot of packages with it including Tomcat server as well, this can take a few minutes to download and install everything. Download the latest version of Solr files from here.

Once the installation has completed, we can unzip the file using the following command:

unzip -q solr-7.2.1.zip

Now, change your directory into the zip file and you will see the following files inside:

Starting Apache Solr Node

Now that we have downloaded Apache Solr packages on our machine, we can do more as a developer from a node interface, so we will start a node instance for Solr where we can actually make collections, store data and make searchable queries.

Run the following command to start cluster setup:

./bin/solr start -e cloud

We will see the following output with this command:

Many questions will be asked but we will setup a single node Solr cluster with all of the default configuration. As shown in the final step, Solr node interface will be available at:

localhost:8983/solr

where 8983 is the default port for the node. Once we visit above URL, we will see the Node interface:

Using Collections in Solr

Now that our node interface is up and running, we can create a collection using the command:

./bin/solr create_collection -c linux_hint_collection

and we will see the following output:

Avoid the warnings for now. We can even see the collection in Node interface as well now:

Now, we can start by defining a schema in Apache Solr by selecting the schema section:

We can now start inserting data into our collections. Let’s insert a JSON document into our collection here:

curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/linux_hint_collection/update/json/docs' --data-binary '
{
"id": "iduye",
"name": "Shubham"
}'

We will see a success response against this command:

As a final command, let us see how we can GET all data from Solr collection:

curl http://localhost:8983/solr/linux_hint_collection/get?id=iduye

We will see the following output: