Saturday, September 27, 2014

Tip #4 [bash]: Vim key bindings

If you are a "Vim"er and want to leverage all the might you have in Vim when working in the command line (to navigate, replace, edit and so on), just use vim key bindings!

To do that, in your ./bash_profile (or .bashrc) file, make sure you have the following line:

set -o vim

And you are all set! Happy "viming"!

Nutch + Solr for a local filesystem search engine

THE PROBLEM


Suppose you work in an organization that has different softwares and each of these have different documentation systems (one is a C++ project documented with e.g. Doxygen, another one is a Python project documented with Sphinx, etc....). Some of this documentation is not even software-related, just a set of web pages that describe a processing campaign or even a set of PDF documents with system requirements or architecture definitions. In order to have a centralized point that indexes all these information you might want to have a search tool that acts as a Knowledge Management System (KMS).

Additionally, most of this documentation might be proprietary and not publicly published in the web, which makes Google not appropriate for this task.

THE SOLUTION


A possible solution for this problem is building a system based on Nutch (a web crawler and indexer) and Solr (a search platform).


SOFTWARE


To implement this solution, I've used the following components (a bit outdated at the time of publishing this tutorial due to the old Java version I am using):


NUTCH: CONFIGURATION & CRAWLING


We are using Nutch to crawl through all our content, parse it and build a database that will be then pushed to Solr (the actual search engine).

The steps I will summarize here are based on the instructions outlined here, here and here.

Once you have downloaded the Nutch tar (or zip) file, you will have to do the following three actions:
  • Create a list of seeds. These seeds are used by Nutch as a starting point for the crawl. Nutch will start from those seeds and it will crawl through the links in these seeds, then the links of the links and then the links of the links of the links and so on (depending on the depth of the crawl, that can be configured). To create this list:

  • # Create a "urls" folder and create a file with the seeds
    > cd /path/to/apache-nutch-1.7
    > mkdir urls
    > touch urls/seeds
    

    In the seeds file, you can have all the URL or paths to the resources you intend to include in your KMS, for instance:

    file:/home/myusername/path/to/resource1
    file:/usr1/project1/documentation
    file:/usr1/project2/docs
    

    Be careful on the backslashes after the file protocol. You might need to escape them depending if you are trying to access a fileshare or filesystem (more information here)

  • Set-up a resource filter. In order to control which addresses to fetch (and which ones to skip) as well as the file types that should be included or excluded from the crawling process, Nutch includes, in the configuration folder (apache-nutch-1.7/conf), the file regex-urlfilter.txt. This file contains Regular expression rules that are used to define this resource filter. In our case, we will have to comment out the line that skips the file protocol:

    # The negative sign at the start of the line indicates to 
    # exclude all entries that match the regexp that follows
    -^(file|ftp|mailto):
    

    to
    # We do want to accept all entries with a file protocol since we are not
    # only interested in remote HTML documents but also files in the 
    # local filesystem
    #-^(file|ftp|mailto):
    

    and we will have to include a specific regexp to consider the resources stated in the seed, otherwise they will be ignored. Change:

    # Regexp that accepts everyting else
    +.
    

    to
    # We want to limit only the resources we are interested int
    #+.
    +^file://home/myusername/path/to/resource1
    +^file://usr1/project1/documentation
    +^file://usr1/project2/docs
    

  • Define the behavior of Nutch. We will have to modify the file conf/nutch-site.xml to instruct Nutch which plugins to include (in particular we have to make sure that the protocol-file is included so that is able to crawl through the local filesystem). In this file we will have to include the following lines between the <configuration></configuration> tags:
  •  
    <property>
        <name>http.agent.name</name>
        <value>KMS Nutch</value>
    </property>
    
    <property>
        <name>plugin.includes</name>
        <value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
        <description>List of protocols to be used by Nutch</description>
    </property>
    
    <property>
        <name>file.content.ignored</name>
        <value>false</value>
        <description>We want to store the content of the files in the database (to be indexed later)</description>
    </property>
    
    <property>
        <name>file.content.limit</name>
        <value>65536</value>
        <description>The length limit for downloaded content using the file://
        protocol, in bytes.</description>
    </property>
    
    <property>
        <name>file.crawl.parent</name>
        <value>false</value>
        <description>This avoid to crawl parent directories</description>
    </property>
    
Once the setup is done, the crawl is launched with the command:

> cd /path/to/apache-nutch-1.7
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50

This will crawl, fetch and parse all the resources from the seed we have specified and build a database with this content. This content will be later pushed to Solr.

SOLR: CONFIGURATION & PUSHING DATA


Once the package has been dowloaded, uncompress the file and go to the Solr directory that has been created. We will be working with the example that is being provided as a baseline.

> cd /path/to/solr-4.4.0/example
> java -jar start.jar

This will fire the Solr server. We can go now to a browser and type in the URL field the local site to setup Solr and check that the server is indeed running:

http://localhost:8983/solr/admin/

In order to make things work we do need to change some files to tell Solr which is the format that he will be using (the one created by Nutch):


  • Change solr-4.4.0/example/solr/collection1/conf/solrconfig.xml to add a request handler for Nutch:
  • <requestHandler name="/nutch" class="solr.SearchHandler" >
        <lst name="defaults">
            <str name="defType">dismax<str>
            <str name="echoParams">explicit<str>
            <float name="tie">0.01<float>
            <str name="qf">
            content^0.5 anchor^1.0 title^1.2
            <str>
            <str name="pf">
            content^0.5 anchor^1.5 title^1.2 site^1.5
            <str>
            <str name="fl">
            url
            <str>
            <str name="mm">
            2<-1 5<-2 6<90%
            <str>
            <int name="ps">100<int>
            <bool name="hl">true<bool>
            <str name="q.alt">str>
            <str name="hl.fl">title url content<str>
            <str name="f.title.hl.fragsize">0<str>
            <str name="f.title.hl.alternateField">title<str>
            <str name="f.url.hl.fragsize">0<str>
            <str name="f.url.hl.alternateField">url<str>
            <str name="f.content.hl.fragmenter">regex<str>
        <lst>
    <requestHandler>
    

  • We will also need to tell Solr which is the format of the database given by Nutch, we do that by adding the description of the fields provided by Nutch to the file (schema) solr-4.4.0/example/solr/collection1/conf/schema.xml:
  • <field name="digest"   type="text_general" stored="true" indexed="true"/>
    <field name="boost"    type="text_general" stored="true" indexed="true"/>
    <field name="segment"  type="text_general" stored="true" indexed="true"/>
    <field name="host"     type="text_general" stored="true" indexed="true"/>
    <field name="site"     type="text_general" stored="true" indexed="true"/>
    <field name="content"  type="text_general" stored="true" indexed="true"/>
    <field name="tstamp"   type="text_general" stored="true" indexed="false"/>
    <field name="url"      type="string"       stored="true" indexed="true"/>
    <field name="anchor"   type="text_general" stored="true" indexed="false" multiValued="true"/>
    

    Make sure that the stored value for content is set to true (so later we will be able to search the content besides the title).

Once this setup has been done, we are ready to push the data we fetched with Nutch into Solr. We do this step by issuing the following command:

> cd /path/to/apache-nutch-1.7
> # Pick the last segment that has been processed
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/segments/$SEGMENT crawl/segments/*

Now the data is ready to be used in Solr. We move now to the browser and type the following URL (we might need to restart Solr):

http://localhost:8983/solr/admin/

and we select the core that correspond to the baseline example that we were using (collection1). This core has an option to perform a Query. We can try then to query Solr (in the q field) with any string we want. If everything went well, we should be able to see the search results in different formats (XML, CSV, ...). Unfortunately this example does not provide with a pretty HTML front-end to show the results. This might be the object for a future blog entry...



Friday, September 19, 2014

Tip #3 [sed]: Group capturing

Suppose you have the following text file (let's name it file.txt)

date:2014-01-01 value [10cm]
date:2014-01-02 value [11cm]
date:2014-01-03 value [15cm]
date:2014-01-04 value [19cm]

and you want to strip the date and the numeric vale, like so

2014-01-01 10
2014-01-02 11
2014-01-03 15
2014-01-04 19

You can easily achieve this in the command line using group capturing with sed:

sed -r 's/.*:(....-..-..) value \[(.*)cm\]/\1 \2/g' file.txt

sed captures the first and second group (defined by the use of parenthesis) and prints them out with the identifiers \1 and \2.

The '-r' option allows to use extended regexp expressions. However, this option is usually for Linux distros. If you are using Mac OS X, you will probably have to use the option '-E' instead. If in doubt, check the help by typing man sed.

You can have more information on group capturing with regular expressions here.

Thursday, September 18, 2014

Tip #2 [gnuplot]: Use of macros for repetitive commands

When you have a gnuplot script that uses the same command over and over again, use macros. The nice thing is that you can then modify the contents of the macro in the script, without having to change the macro.

Suppose you want to plot the columns 1 and 3 of three different data files.

In your gnuplot script include:

set macro
MACRO = "datafile using 1:3"

Then you can modify the value of datafile within the script

datafile="file1.txt"
plot @MACRO
pause(-1)


datafile="file2.txt"

plot @MACRO
pause(-1)


datafile="file3.txt"

plot @MACRO
pause(-1)

More extensive and complex examples of the usage of gnuplot macros can be found here.

Tip #1 [ftp]: Use of macros to automate file uploads

Let's suppose you have a directory structure of this type

current working directory
    |
    +--- 2000 --- 2000_file1.txt
    |             2000_file2.txt
    |             2000_file3.txt
    |             .
    |             .
    |             .
    +--- 2001 --- 2001_file1.txt
    |             2001_file2.txt
    |             2001_file3.txt
    |             .
    |             .
    |             .
    |     .
    |     .
    |     .
    +--- 2014 --- 2014_file1.txt
                  2014_file2.txt
                  2014_file3.txt
                  .
                  .
                  .
                                                                    
Let's assume now that you have to upload all these files into an ftp server (ftp.site.com) that has the following directory structure on the home directory when you login (with username "myusername" and password "1234"):

user_remote_home_directory
    |
    +--- 2000 
    | 
    +--- 2001 
    |     .
    |     .
    |     .
    +--- 2014 


If the number of files (or directories) is small, you can do it manually with the put or mput commands, but as soon as the number of files increase, this approach is not an option anymore: you might not have the time to do it this way. In those cases, use macros.

We will be using Linux bash for this task. You will iterate over the local folders and opening and closing ftp sessions whenever you enter in each directory. In order to avoid having to login each time (you want to automate the login as well), create the file .netrc in your home directory (ftp will use it):

> cd ~
> touch .netrc

and add the following line

# Autologin
ftp.site.com myusername 1234

This line will make the autologin on ftp.site.com with your user account without FTP prompting you every time you open a session. Then, add the following lines to the .netrc file:

macdef siteput
prompt
binary
hash
cd $1
mput *
quit

The macdef statement is used to define an FTP macro (in this case we have named it siteput). This macro does the following

  1. prompt: first deactivate the prompt. Since we are using mput, we do not want confirmation to upload every single file (that is the whole point of this, right?)
  2. binary: we are enforcing binary upload (instead of ASCII)
  3. hash:  we want some sort of upload progress, that is what hash is used for
  4. cd $1: change remote folder, specified by the argument passed to the macro (see below)
  5. mput *: Proceed to upload all the files of the local folder.
  6. quit: close ftp session
Now this macro would be called using a command like the following:

echo "$ macro_name arg1 arg2 ... argN" | ftp ftp.site.com

In our particular case, we will be looping over all our local folders and calling the macro for the corresponding remote folder

for year in `year 2000 2014`
do
    # Informative printout
    echo $year

    # We now call the macro for the current folder/year using
    # a sub-shell (note the use of parenthesis) to avoid 
    # the need of executing "cd .." after the command has been
    # launched (i.e. the "cd $year" affects the sub-shell, not
    # the main shell where this loop takes place).
    ( cd $year; echo "$ siteput $year" | ftp ftp.site.com )

done