Wednesday, July 18, 2012

NetBeans SVN Subversion Error 500

There seems to be an issue with NetBeans SVN client relating to using an old cached (username/password) which pops up when it is changed. In essence the IDE complains with many subversion operations as follows:
org.tigris.subversion.javahl.clientexception: OPTIONS of 500 Internal Server Error

The solution to the problem is to restart the IDE with

./netbeans -J-DsvnClientAdapterFactory=commandline

This should force a dialog to popup and ask about the (username/password)

Thursday, July 12, 2012

Hadoop vs Google

An interesting article that discusses  "Why the days are numbered for Hadoop as we know it" . It illustrates how Google has been the main driver to the Hadoop platform and how Hadoop may fail if it does not catch up with Google's stack progress.

Wednesday, July 4, 2012

KMeans Clustering Using Apache Mahout

This post summarizes the steps necessary to cluster a set of documents  (Reuters dataset as an example) and shows how to obtain a document to cluster mapping.

Step 1 : Convert the input text to sequences

The first step is to convert the input documents into a sequence form using seqdirectory . This is a necessary step before being able to perform KMeans clustering:

$mahout seqdirectory -i ~/Downloads/reuters21578/parsedtext -o ~/Downloads/reuters21578/parsedtext-seqdir -c UTF-8 -chunk 5

Step 2:  Convert the generated sequences to sparse vectors using seq2sparse

It is necessary to obtain vectors from sequence file to use it as part of the input to the clustering process.

$mahout seq2sparse -i ~/Downloads/reuters21578/parsedtext-seqdir -o ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector

Step 3: Run the KMeans clustering

We now run the KMeans clustering. Please refer to the mahout documentation for a description about the KMeans parameters.

$mahout kmeans -i ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans/tfidf-vectors/ -c ~/Downloads/reuters21578/parsedtext-kmeans-clusters -o ~/Downloads/reuters21578/parsedtext-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering -cl

Step 4: Dump the results to file

In order to view the clustering results , it needs to be dumped through the use of a special program called clusterdump. Note that we specify with -o the output of the dump.

$mahout clusterdump -s ~/Downloads/reuters21578/parsedtext-kmeans/clusters-*-final -d ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir ~/Downloads/reuters21578/parsedtext-kmeans/clusteredPoints -o ~/cluster-output.txt

Step 5: Dump the Document to Cluster mapping

The cluster mapping can be generated by using the seqdumper program. The ID of the cluster is specified by "Key: <clusterid>" in the file and the filename is specified with "value: <documentid>". The output file will contain other fields which we will crop out in Step 6.

$mahout seqdumper -s ~/Downloads/reuters21578/parsedtext-kmeans/clusteredPoints/part-m-00000 > cluster-points.txt

Step 6: Extract the Document to Cluster mapping

The following perl script extracts the cluster-id and document-id from the dump file:

# Script to extract the doc-cluster mapping from Mahout output

open(FILEOUT,"> $ARGV[1]");

 if($_ =~ /Key: (.+?): (.+?) vec: \/(.+?) =/){
  print FILEOUT "$1\t$3\n";

close FILEOUT;
print "Done\n";

To run the script , simply type on the command line:

$perl cluster-points.txt cluster-mapping.txt 
The cluster-mapping.txt should now contain a tab-delimited set of document to cluster mapping.