THE PROBLEM
Suppose you work in an organization that has different softwares and each of these have different documentation systems (one is a C++ project documented with e.g. Doxygen, another one is a Python project documented with Sphinx, etc....). Some of this documentation is not even software-related, just a set of web pages that describe a processing campaign or even a set of PDF documents with system requirements or architecture definitions. In order to have a centralized point that indexes all these information you might want to have a search tool that acts as a Knowledge Management System (KMS).
Additionally, most of this documentation might be proprietary and not publicly published in the web, which makes Google not appropriate for this task.
THE SOLUTION
A possible solution for this problem is building a system based on Nutch (a web crawler and indexer) and Solr (a search platform).
SOFTWARE
To implement this solution, I've used the following components (a bit outdated at the time of publishing this tutorial due to the old Java version I am using):
- Operating system Mac OS X 10.6.8
- Apache Nutch (version 1.7.0)
- Apache Solr (version 4.4.0)
- Java, version "1.6.0_51"
NUTCH: CONFIGURATION & CRAWLING
We are using Nutch to crawl through all our content, parse it and build a database that will be then pushed to Solr (the actual search engine).
The steps I will summarize here are based on the instructions outlined here, here and here.
Once you have downloaded the Nutch tar (or zip) file, you will have to do the following three actions:
- Create a list of seeds. These seeds are used by Nutch as a starting point for the crawl. Nutch will start from those seeds and it will crawl through the links in these seeds, then the links of the links and then the links of the links of the links and so on (depending on the depth of the crawl, that can be configured). To create this list:
# Create a "urls" folder and create a file with the seeds > cd /path/to/apache-nutch-1.7 > mkdir urls > touch urls/seeds
In the seeds file, you can have all the URL or paths to the resources you intend to include in your KMS, for instance:
file:/home/myusername/path/to/resource1 file:/usr1/project1/documentation file:/usr1/project2/docs
Be careful on the backslashes after the file protocol. You might need to escape them depending if you are trying to access a fileshare or filesystem (more information here)
# The negative sign at the start of the line indicates to # exclude all entries that match the regexp that follows -^(file|ftp|mailto):
to
# We do want to accept all entries with a file protocol since we are not # only interested in remote HTML documents but also files in the # local filesystem #-^(file|ftp|mailto):
and we will have to include a specific regexp to consider the resources stated in the seed, otherwise they will be ignored. Change:
# Regexp that accepts everyting else +.
to
# We want to limit only the resources we are interested int #+. +^file://home/myusername/path/to/resource1 +^file://usr1/project1/documentation +^file://usr1/project2/docs
<property> <name>http.agent.name</name> <value>KMS Nutch</value> </property> <property> <name>plugin.includes</name> <value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value> <description>List of protocols to be used by Nutch</description> </property> <property> <name>file.content.ignored</name> <value>false</value> <description>We want to store the content of the files in the database (to be indexed later)</description> </property> <property> <name>file.content.limit</name> <value>65536</value> <description>The length limit for downloaded content using the file:// protocol, in bytes.</description> </property> <property> <name>file.crawl.parent</name> <value>false</value> <description>This avoid to crawl parent directories</description> </property>Once the setup is done, the crawl is launched with the command:
> cd /path/to/apache-nutch-1.7
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
SOLR: CONFIGURATION & PUSHING DATA
Once the package has been dowloaded, uncompress the file and go to the Solr directory that has been created. We will be working with the example that is being provided as a baseline.
> cd /path/to/solr-4.4.0/example
> java -jar start.jar
This will fire the Solr server. We can go now to a browser and type in the URL field the local site to setup Solr and check that the server is indeed running:
http://localhost:8983/solr/admin/
In order to make things work we do need to change some files to tell Solr which is the format that he will be using (the one created by Nutch):
- Change solr-4.4.0/example/solr/collection1/conf/solrconfig.xml to add a request handler for Nutch:
<requestHandler name="/nutch" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax<str> <str name="echoParams">explicit<str> <float name="tie">0.01<float> <str name="qf"> content^0.5 anchor^1.0 title^1.2 <str> <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 <str> <str name="fl"> url <str> <str name="mm"> 2<-1 5<-2 6<90% <str> <int name="ps">100<int> <bool name="hl">true<bool> <str name="q.alt">str> <str name="hl.fl">title url content<str> <str name="f.title.hl.fragsize">0<str> <str name="f.title.hl.alternateField">title<str> <str name="f.url.hl.fragsize">0<str> <str name="f.url.hl.alternateField">url<str> <str name="f.content.hl.fragmenter">regex<str> <lst> <requestHandler>
<field name="digest" type="text_general" stored="true" indexed="true"/> <field name="boost" type="text_general" stored="true" indexed="true"/> <field name="segment" type="text_general" stored="true" indexed="true"/> <field name="host" type="text_general" stored="true" indexed="true"/> <field name="site" type="text_general" stored="true" indexed="true"/> <field name="content" type="text_general" stored="true" indexed="true"/> <field name="tstamp" type="text_general" stored="true" indexed="false"/> <field name="url" type="string" stored="true" indexed="true"/> <field name="anchor" type="text_general" stored="true" indexed="false" multiValued="true"/>
Make sure that the stored value for content is set to true (so later we will be able to search the content besides the title).
Once this setup has been done, we are ready to push the data we fetched with Nutch into Solr. We do this step by issuing the following command:
> cd /path/to/apache-nutch-1.7
> # Pick the last segment that has been processed
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> # Pick the last segment that has been processed
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/segments/$SEGMENT crawl/segments/*
Now the data is ready to be used in Solr. We move now to the browser and type the following URL (we might need to restart Solr):
http://localhost:8983/solr/admin/
and we select the core that correspond to the baseline example that we were using (collection1). This core has an option to perform a Query. We can try then to query Solr (in the q field) with any string we want. If everything went well, we should be able to see the search results in different formats (XML, CSV, ...). Unfortunately this example does not provide with a pretty HTML front-end to show the results. This might be the object for a future blog entry...
Your blog has given me that thing which I never expect to get from all over the websites. Nice post guys!
ReplyDelete