If you are a "Vim"er and want to leverage all the might you have in Vim when working in the command line (to navigate, replace, edit and so on), just use vim key bindings!
To do that, in your ./bash_profile (or .bashrc) file, make sure you have the following line:
set -o vim
And you are all set! Happy "viming"!
Prograstinator
Tips and tricks for data analysts and scientists struggling with programming
Saturday, September 27, 2014
Nutch + Solr for a local filesystem search engine
THE PROBLEM
Suppose you work in an organization that has different softwares and each of these have different documentation systems (one is a C++ project documented with e.g. Doxygen, another one is a Python project documented with Sphinx, etc....). Some of this documentation is not even software-related, just a set of web pages that describe a processing campaign or even a set of PDF documents with system requirements or architecture definitions. In order to have a centralized point that indexes all these information you might want to have a search tool that acts as a Knowledge Management System (KMS).
Additionally, most of this documentation might be proprietary and not publicly published in the web, which makes Google not appropriate for this task.
THE SOLUTION
A possible solution for this problem is building a system based on Nutch (a web crawler and indexer) and Solr (a search platform).
SOFTWARE
To implement this solution, I've used the following components (a bit outdated at the time of publishing this tutorial due to the old Java version I am using):
- Operating system Mac OS X 10.6.8
- Apache Nutch (version 1.7.0)
- Apache Solr (version 4.4.0)
- Java, version "1.6.0_51"
NUTCH: CONFIGURATION & CRAWLING
We are using Nutch to crawl through all our content, parse it and build a database that will be then pushed to Solr (the actual search engine).
The steps I will summarize here are based on the instructions outlined here, here and here.
Once you have downloaded the Nutch tar (or zip) file, you will have to do the following three actions:
- Create a list of seeds. These seeds are used by Nutch as a starting point for the crawl. Nutch will start from those seeds and it will crawl through the links in these seeds, then the links of the links and then the links of the links of the links and so on (depending on the depth of the crawl, that can be configured). To create this list:
# Create a "urls" folder and create a file with the seeds > cd /path/to/apache-nutch-1.7 > mkdir urls > touch urls/seeds
In the seeds file, you can have all the URL or paths to the resources you intend to include in your KMS, for instance:
file:/home/myusername/path/to/resource1 file:/usr1/project1/documentation file:/usr1/project2/docs
Be careful on the backslashes after the file protocol. You might need to escape them depending if you are trying to access a fileshare or filesystem (more information here)
# The negative sign at the start of the line indicates to # exclude all entries that match the regexp that follows -^(file|ftp|mailto):
to
# We do want to accept all entries with a file protocol since we are not # only interested in remote HTML documents but also files in the # local filesystem #-^(file|ftp|mailto):
and we will have to include a specific regexp to consider the resources stated in the seed, otherwise they will be ignored. Change:
# Regexp that accepts everyting else +.
to
# We want to limit only the resources we are interested int #+. +^file://home/myusername/path/to/resource1 +^file://usr1/project1/documentation +^file://usr1/project2/docs
<property> <name>http.agent.name</name> <value>KMS Nutch</value> </property> <property> <name>plugin.includes</name> <value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value> <description>List of protocols to be used by Nutch</description> </property> <property> <name>file.content.ignored</name> <value>false</value> <description>We want to store the content of the files in the database (to be indexed later)</description> </property> <property> <name>file.content.limit</name> <value>65536</value> <description>The length limit for downloaded content using the file:// protocol, in bytes.</description> </property> <property> <name>file.crawl.parent</name> <value>false</value> <description>This avoid to crawl parent directories</description> </property>Once the setup is done, the crawl is launched with the command:
> cd /path/to/apache-nutch-1.7
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
SOLR: CONFIGURATION & PUSHING DATA
Once the package has been dowloaded, uncompress the file and go to the Solr directory that has been created. We will be working with the example that is being provided as a baseline.
> cd /path/to/solr-4.4.0/example
> java -jar start.jar
This will fire the Solr server. We can go now to a browser and type in the URL field the local site to setup Solr and check that the server is indeed running:
http://localhost:8983/solr/admin/
In order to make things work we do need to change some files to tell Solr which is the format that he will be using (the one created by Nutch):
- Change solr-4.4.0/example/solr/collection1/conf/solrconfig.xml to add a request handler for Nutch:
<requestHandler name="/nutch" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax<str> <str name="echoParams">explicit<str> <float name="tie">0.01<float> <str name="qf"> content^0.5 anchor^1.0 title^1.2 <str> <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 <str> <str name="fl"> url <str> <str name="mm"> 2<-1 5<-2 6<90% <str> <int name="ps">100<int> <bool name="hl">true<bool> <str name="q.alt">str> <str name="hl.fl">title url content<str> <str name="f.title.hl.fragsize">0<str> <str name="f.title.hl.alternateField">title<str> <str name="f.url.hl.fragsize">0<str> <str name="f.url.hl.alternateField">url<str> <str name="f.content.hl.fragmenter">regex<str> <lst> <requestHandler>
<field name="digest" type="text_general" stored="true" indexed="true"/> <field name="boost" type="text_general" stored="true" indexed="true"/> <field name="segment" type="text_general" stored="true" indexed="true"/> <field name="host" type="text_general" stored="true" indexed="true"/> <field name="site" type="text_general" stored="true" indexed="true"/> <field name="content" type="text_general" stored="true" indexed="true"/> <field name="tstamp" type="text_general" stored="true" indexed="false"/> <field name="url" type="string" stored="true" indexed="true"/> <field name="anchor" type="text_general" stored="true" indexed="false" multiValued="true"/>
Make sure that the stored value for content is set to true (so later we will be able to search the content besides the title).
Once this setup has been done, we are ready to push the data we fetched with Nutch into Solr. We do this step by issuing the following command:
> cd /path/to/apache-nutch-1.7
> # Pick the last segment that has been processed
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> # Pick the last segment that has been processed
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/segments/$SEGMENT crawl/segments/*
Now the data is ready to be used in Solr. We move now to the browser and type the following URL (we might need to restart Solr):
http://localhost:8983/solr/admin/
and we select the core that correspond to the baseline example that we were using (collection1). This core has an option to perform a Query. We can try then to query Solr (in the q field) with any string we want. If everything went well, we should be able to see the search results in different formats (XML, CSV, ...). Unfortunately this example does not provide with a pretty HTML front-end to show the results. This might be the object for a future blog entry...
Friday, September 19, 2014
Tip #3 [sed]: Group capturing
Suppose you have the following text file (let's name it file.txt)
date:2014-01-01 value [10cm]
date:2014-01-02 value [11cm]
date:2014-01-03 value [15cm]
date:2014-01-04 value [19cm]
and you want to strip the date and the numeric vale, like so
2014-01-01 10
2014-01-02 11
2014-01-03 15
2014-01-04 19
You can easily achieve this in the command line using group capturing with sed:
sed -r 's/.*:(....-..-..) value \[(.*)cm\]/\1 \2/g' file.txt
sed captures the first and second group (defined by the use of parenthesis) and prints them out with the identifiers \1 and \2.
The '-r' option allows to use extended regexp expressions. However, this option is usually for Linux distros. If you are using Mac OS X, you will probably have to use the option '-E' instead. If in doubt, check the help by typing man sed.
You can have more information on group capturing with regular expressions here.
date:2014-01-01 value [10cm]
date:2014-01-02 value [11cm]
date:2014-01-03 value [15cm]
date:2014-01-04 value [19cm]
and you want to strip the date and the numeric vale, like so
2014-01-01 10
2014-01-02 11
2014-01-03 15
2014-01-04 19
You can easily achieve this in the command line using group capturing with sed:
sed -r 's/.*:(....-..-..) value \[(.*)cm\]/\1 \2/g' file.txt
sed captures the first and second group (defined by the use of parenthesis) and prints them out with the identifiers \1 and \2.
The '-r' option allows to use extended regexp expressions. However, this option is usually for Linux distros. If you are using Mac OS X, you will probably have to use the option '-E' instead. If in doubt, check the help by typing man sed.
You can have more information on group capturing with regular expressions here.
Thursday, September 18, 2014
Tip #2 [gnuplot]: Use of macros for repetitive commands
When you have a gnuplot script that uses the same command over and over again, use macros. The nice thing is that you can then modify the contents of the macro in the script, without having to change the macro.
Suppose you want to plot the columns 1 and 3 of three different data files.
In your gnuplot script include:
set macro
MACRO = "datafile using 1:3"
Then you can modify the value of datafile within the script
datafile="file1.txt"
plot @MACRO
pause(-1)
datafile="file2.txt"
plot @MACRO
pause(-1)
datafile="file3.txt"
plot @MACRO
pause(-1)
More extensive and complex examples of the usage of gnuplot macros can be found here.
Suppose you want to plot the columns 1 and 3 of three different data files.
In your gnuplot script include:
set macro
MACRO = "datafile using 1:3"
Then you can modify the value of datafile within the script
datafile="file1.txt"
plot @MACRO
pause(-1)
datafile="file2.txt"
plot @MACRO
pause(-1)
datafile="file3.txt"
plot @MACRO
pause(-1)
Tip #1 [ftp]: Use of macros to automate file uploads
Let's suppose you have a directory structure of this type
current working directory
|
+--- 2000 --- 2000_file1.txt
If the number of files (or directories) is small, you can do it manually with the put or mput commands, but as soon as the number of files increase, this approach is not an option anymore: you might not have the time to do it this way. In those cases, use macros.
We will be using Linux bash for this task. You will iterate over the local folders and opening and closing ftp sessions whenever you enter in each directory. In order to avoid having to login each time (you want to automate the login as well), create the file .netrc in your home directory (ftp will use it):
> cd ~
> touch .netrc
and add the following line
# Autologin
ftp.site.com myusername 1234
This line will make the autologin on ftp.site.com with your user account without FTP prompting you every time you open a session. Then, add the following lines to the .netrc file:
macdef siteput
prompt
binary
hash
cd $1
mput *
quit
The macdef statement is used to define an FTP macro (in this case we have named it siteput). This macro does the following
echo "$ macro_name arg1 arg2 ... argN"| ftp ftp.site.com
In our particular case, we will be looping over all our local folders and calling the macro for the corresponding remote folder
for year in `year 2000 2014`
do
# Informative printout
echo $year
# We now call the macro for the current folder/year using
# a sub-shell (note the use of parenthesis) to avoid
# the need of executing "cd .." after the command has been
# launched (i.e. the "cd $year" affects the sub-shell, not
# the main shell where this loop takes place).
( cd $year; echo "$ siteput $year" | ftp ftp.site.com )
done
current working directory
|
+--- 2000 --- 2000_file1.txt
| 2000_file2.txt
| 2000_file3.txt
| .
| .
| .
+--- 2001 --- 2001_file1.txt
| .
| .
| .
+--- 2001 --- 2001_file1.txt
| 2001_file2.txt
| 2001_file3.txt
| .
| .
| .
| .
| .
| .
| .
| .
| .
+--- 2014 --- 2014_file1.txt
| .
| .
+--- 2014 --- 2014_file1.txt
2014_file2.txt
2014_file3.txt
.
.
.
.
.
.
Let's assume now that you have to upload all these files into an ftp server (ftp.site.com) that has the following directory structure on the home directory when you login (with username "myusername" and password "1234"):
user_remote_home_directory
|
+--- 2000
|
+--- 2000
|
+--- 2001
+--- 2001
| .
| .
| .
+--- 2014
| .
| .
+--- 2014
If the number of files (or directories) is small, you can do it manually with the put or mput commands, but as soon as the number of files increase, this approach is not an option anymore: you might not have the time to do it this way. In those cases, use macros.
We will be using Linux bash for this task. You will iterate over the local folders and opening and closing ftp sessions whenever you enter in each directory. In order to avoid having to login each time (you want to automate the login as well), create the file .netrc in your home directory (ftp will use it):
> cd ~
> touch .netrc
and add the following line
# Autologin
ftp.site.com myusername 1234
macdef siteput
prompt
binary
hash
cd $1
mput *
quit
The macdef statement is used to define an FTP macro (in this case we have named it siteput). This macro does the following
- prompt: first deactivate the prompt. Since we are using mput, we do not want confirmation to upload every single file (that is the whole point of this, right?)
- binary: we are enforcing binary upload (instead of ASCII)
- hash: we want some sort of upload progress, that is what hash is used for
- cd $1: change remote folder, specified by the argument passed to the macro (see below)
- mput *: Proceed to upload all the files of the local folder.
- quit: close ftp session
echo "$ macro_name arg1 arg2 ... argN"
for year in `year 2000 2014`
do
# Informative printout
echo $year
# We now call the macro for the current folder/year using
# a sub-shell (note the use of parenthesis) to avoid
# the need of executing "cd .." after the command has been
# launched (i.e. the "cd $year" affects the sub-shell, not
# the main shell where this loop takes place).
( cd $year; echo "$ siteput $year" | ftp ftp.site.com )
done
Wednesday, December 1, 2010
IDL documentation with Doxygen
If you are an Interactive Data Language (IDL) programmer and have to make documentation for your routines, you can always use the IDL framework instruction: DOC_LIBRARY. As far as I know, its output is basically a text file with the information of the functions that have a specially formatted header (the ';+' and ';-' strings). However it does not output documents as done in other applications such as Doxygen. In particular, Doxygen has many features that can be beneficial:
The problem with Doxygen is that it does not support IDL. Well, yes, it does support IDL, but the "Interface Description Language", not the "Interactive Data Language" one. In order to support the ITTVIS IDL, I've prepared a ruby script that parses the IDL ".pro" files and converts them to something that Doxygen can understand. I've also attached a sample configuration file for doxygen that you can also use as baseline for your projects and a sample code for testing. You will find the code package in the following github repository (yes, you need git to download the package):
http://github.com/Tryskele/idlDoxygen
Feel free to "pull" the repository and contribute to the improvement of the script if you want!
- Hyperlinked references between methods
- Grouping of methods
- Capability to describe the mathematical operations done by the function using LATEX.
- HTML or hyperlinked PDF output
The problem with Doxygen is that it does not support IDL. Well, yes, it does support IDL, but the "Interface Description Language", not the "Interactive Data Language" one. In order to support the ITTVIS IDL, I've prepared a ruby script that parses the IDL ".pro" files and converts them to something that Doxygen can understand. I've also attached a sample configuration file for doxygen that you can also use as baseline for your projects and a sample code for testing. You will find the code package in the following github repository (yes, you need git to download the package):
http://github.com/Tryskele/idlDoxygen
Feel free to "pull" the repository and contribute to the improvement of the script if you want!
Labels:
documentation,
doxygen,
IDL,
Interactive Data Language
Friday, June 18, 2010
[Trick] How to invert window colors
Sometimes the black foreground on white background can be tiring for your eyes. This is specially true if you have to spend a lot of time in front of computers. In mac and linux there is a simple trick in which you can invert the color scheme:
- On MAC (inverts the colors of the whole desktop) Press simultaneously ctrl+alt+cmd+8
- On Linux (inverts the colors of the active window) Press simultaneous super+N (where supper is the key with the windows logo)
Subscribe to:
Posts (Atom)