Prograstinator: Nutch + Solr for a local filesystem search engine

THE PROBLEM

Suppose you work in an organization that has different softwares and each of these have different documentation systems (one is a C++ project documented with e.g. Doxygen, another one is a Python project documented with Sphinx, etc....). Some of this documentation is not even software-related, just a set of web pages that describe a processing campaign or even a set of PDF documents with system requirements or architecture definitions. In order to have a centralized point that indexes all these information you might want to have a search tool that acts as a Knowledge Management System (KMS).

Additionally, most of this documentation might be proprietary and not publicly published in the web, which makes Google not appropriate for this task.

THE SOLUTION

A possible solution for this problem is building a system based on Nutch (a web crawler and indexer) and Solr (a search platform).

SOFTWARE

To implement this solution, I've used the following components (a bit outdated at the time of publishing this tutorial due to the old Java version I am using):

Operating system Mac OS X 10.6.8
Apache Nutch (version 1.7.0)
Apache Solr (version 4.4.0)
Java, version "1.6.0_51"

NUTCH: CONFIGURATION & CRAWLING

We are using Nutch to crawl through all our content, parse it and build a database that will be then pushed to Solr (the actual search engine).

The steps I will summarize here are based on the instructions outlined here, here and here.

Once you have downloaded the Nutch tar (or zip) file, you will have to do the following three actions:

Create a list of seeds. These seeds are used by Nutch as a starting point for the crawl. Nutch will start from those seeds and it will crawl through the links in these seeds, then the links of the links and then the links of the links of the links and so on (depending on the depth of the crawl, that can be configured). To create this list:

# Create a "urls" folder and create a file with the seeds
> cd /path/to/apache-nutch-1.7
> mkdir urls
> touch urls/seeds

file:/home/myusername/path/to/resource1
file:/usr1/project1/documentation
file:/usr1/project2/docs

here

Set-up a resource filter. In order to control which addresses to fetch (and which ones to skip) as well as the file types that should be included or excluded from the crawling process, Nutch includes, in the configuration folder (apache-nutch-1.7/conf), the file regex-urlfilter.txt. This file contains Regular expression rules that are used to define this resource filter. In our case, we will have to comment out the line that skips the file protocol:
```
# The negative sign at the start of the line indicates to 
# exclude all entries that match the regexp that follows
-^(file|ftp|mailto):
```
to
```
# We do want to accept all entries with a file protocol since we are not
# only interested in remote HTML documents but also files in the 
# local filesystem
#-^(file|ftp|mailto):
```
and we will have to include a specific regexp to consider the resources stated in the seed, otherwise they will be ignored. Change:
```
# Regexp that accepts everyting else
+.
```
to
```
# We want to limit only the resources we are interested int
#+.
+^file://home/myusername/path/to/resource1
+^file://usr1/project1/documentation
+^file://usr1/project2/docs
```

Define the behavior of Nutch. We will have to modify the file conf/nutch-site.xml to instruct Nutch which plugins to include (in particular we have to make sure that the protocol-file is included so that is able to crawl through the local filesystem). In this file we will have to include the following lines between the <configuration></configuration> tags:

 
<property>
    <name>http.agent.name</name>
    <value>KMS Nutch</value>
</property>

<property>
    <name>plugin.includes</name>
    <value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
    <description>List of protocols to be used by Nutch</description>
</property>

<property>
    <name>file.content.ignored</name>
    <value>false</value>
    <description>We want to store the content of the files in the database (to be indexed later)</description>
</property>

<property>
    <name>file.content.limit</name>
    <value>65536</value>
    <description>The length limit for downloaded content using the file://
    protocol, in bytes.</description>
</property>

<property>
    <name>file.crawl.parent</name>
    <value>false</value>
    <description>This avoid to crawl parent directories</description>
</property>

Once the setup is done, the crawl is launched with the command:

> cd /path/to/apache-nutch-1.7
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50

This will crawl, fetch and parse all the resources from the seed we have specified and build a database with this content. This content will be later pushed to Solr.

SOLR: CONFIGURATION & PUSHING DATA

Once the package has been dowloaded, uncompress the file and go to the Solr directory that has been created. We will be working with the example that is being provided as a baseline.

> cd /path/to/solr-4.4.0/example

> java -jar start.jar

This will fire the Solr server. We can go now to a browser and type in the URL field the local site to setup Solr and check that the server is indeed running:

http://localhost:8983/solr/admin/

In order to make things work we do need to change some files to tell Solr which is the format that he will be using (the one created by Nutch):

Change solr-4.4.0/example/solr/collection1/conf/solrconfig.xml to add a request handler for Nutch:

<requestHandler name="/nutch" class="solr.SearchHandler" >
    <lst name="defaults">
        <str name="defType">dismax<str>
        <str name="echoParams">explicit<str>
        <float name="tie">0.01<float>
        <str name="qf">
        content^0.5 anchor^1.0 title^1.2
        <str>
        <str name="pf">
        content^0.5 anchor^1.5 title^1.2 site^1.5
        <str>
        <str name="fl">
        url
        <str>
        <str name="mm">
        2<-1 5<-2 6<90%
        <str>
        <int name="ps">100<int>
        <bool name="hl">true<bool>
        <str name="q.alt">str>
        <str name="hl.fl">title url content<str>
        <str name="f.title.hl.fragsize">0<str>
        <str name="f.title.hl.alternateField">title<str>
        <str name="f.url.hl.fragsize">0<str>
        <str name="f.url.hl.alternateField">url<str>
        <str name="f.content.hl.fragmenter">regex<str>
    <lst>
<requestHandler>

We will also need to tell Solr which is the format of the database given by Nutch, we do that by adding the description of the fields provided by Nutch to the file (schema) solr-4.4.0/example/solr/collection1/conf/schema.xml:

<field name="digest"   type="text_general" stored="true" indexed="true"/>
<field name="boost"    type="text_general" stored="true" indexed="true"/>
<field name="segment"  type="text_general" stored="true" indexed="true"/>
<field name="host"     type="text_general" stored="true" indexed="true"/>
<field name="site"     type="text_general" stored="true" indexed="true"/>
<field name="content"  type="text_general" stored="true" indexed="true"/>
<field name="tstamp"   type="text_general" stored="true" indexed="false"/>
<field name="url"      type="string"       stored="true" indexed="true"/>
<field name="anchor"   type="text_general" stored="true" indexed="false" multiValued="true"/>

stored

content

Once this setup has been done, we are ready to push the data we fetched with Nutch into Solr. We do this step by issuing the following command:

> cd /path/to/apache-nutch-1.7
> # Pick the last segment that has been processed
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/segments/$SEGMENT crawl/segments/*

Now the data is ready to be used in Solr. We move now to the browser and type the following URL (we might need to restart Solr):

http://localhost:8983/solr/admin/

and we select the core that correspond to the baseline example that we were using (collection1). This core has an option to perform a Query. We can try then to query Solr (in the q field) with any string we want. If everything went well, we should be able to see the search results in different formats (XML, CSV, ...). Unfortunately this example does not provide with a pretty HTML front-end to show the results. This might be the object for a future blog entry...

Prograstinator

Saturday, September 27, 2014

Nutch + Solr for a local filesystem search engine

THE PROBLEM

THE SOLUTION

SOFTWARE

NUTCH: CONFIGURATION & CRAWLING

SOLR: CONFIGURATION & PUSHING DATA

1 comment: