## Monday, November 19, 2012

### Place figures and tables correctly in Latex

One way to force Latex to place your figures and tables where you want is by using the "float" package. You need to specify an "H" directive whenever you add a table or figure. The package usage is as follows:
 \usepackage{float}\begin{figure}{H}.. Attributes\end{figure} 
The same scenario applies for tables. The float package allows not only placing tables and figures in the desired location but also is text friendly in the sense that it plays nice when you have text in between figures/tables.

## Wednesday, July 18, 2012

### NetBeans SVN Subversion Error 500

There seems to be an issue with NetBeans SVN client relating to using an old cached (username/password) which pops up when it is changed. In essence the IDE complains with many subversion operations as follows:

org.tigris.subversion.javahl.clientexception: OPTIONS of 500 Internal Server Error

The solution to the problem is to restart the IDE with

./netbeans -J-DsvnClientAdapterFactory=commandline

## Thursday, July 12, 2012

An interesting article that discusses  "Why the days are numbered for Hadoop as we know it" . It illustrates how Google has been the main driver to the Hadoop platform and how Hadoop may fail if it does not catch up with Google's stack progress.

## Wednesday, July 4, 2012

### KMeans Clustering Using Apache Mahout

This post summarizes the steps necessary to cluster a set of documents  (Reuters dataset as an example) and shows how to obtain a document to cluster mapping.

Step 1 : Convert the input text to sequences

The first step is to convert the input documents into a sequence form using seqdirectory . This is a necessary step before being able to perform KMeans clustering:

$mahout seqdirectory -i ~/Downloads/reuters21578/parsedtext -o ~/Downloads/reuters21578/parsedtext-seqdir -c UTF-8 -chunk 5 Step 2: Convert the generated sequences to sparse vectors using seq2sparse It is necessary to obtain vectors from sequence file to use it as part of the input to the clustering process. $mahout seq2sparse -i ~/Downloads/reuters21578/parsedtext-seqdir -o ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector

Step 3: Run the KMeans clustering

We now run the KMeans clustering. Please refer to the mahout documentation for a description about the KMeans parameters.

$mahout kmeans -i ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans/tfidf-vectors/ -c ~/Downloads/reuters21578/parsedtext-kmeans-clusters -o ~/Downloads/reuters21578/parsedtext-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering -cl Step 4: Dump the results to file In order to view the clustering results , it needs to be dumped through the use of a special program called clusterdump. Note that we specify with -o the output of the dump. $mahout clusterdump -s ~/Downloads/reuters21578/parsedtext-kmeans/clusters-*-final -d ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir ~/Downloads/reuters21578/parsedtext-kmeans/clusteredPoints -o ~/cluster-output.txt

Step 5: Dump the Document to Cluster mapping

The cluster mapping can be generated by using the seqdumper program. The ID of the cluster is specified by "Key: <clusterid>" in the file and the filename is specified with "value: <documentid>". The output file will contain other fields which we will crop out in Step 6.

$mahout seqdumper -s ~/Downloads/reuters21578/parsedtext-kmeans/clusteredPoints/part-m-00000 > cluster-points.txt Step 6: Extract the Document to Cluster mapping The following perl script extracts the cluster-id and document-id from the dump file: # Script to extract the doc-cluster mapping from Mahout output open(OUTPUT,"$ARGV[0]");
open(FILEOUT,"> $ARGV[1]"); while(<OUTPUT>){ chomp; if($_ =~ /Key: (.+?): (.+?) vec: \/(.+?) =/){
print FILEOUT "$1\t$3\n";
}

}

close FILEOUT;
print "Done\n";



To run the script , simply type on the command line:

$perl parseDump.pl cluster-points.txt cluster-mapping.txt  The cluster-mapping.txt should now contain a tab-delimited set of document to cluster mapping. ## Thursday, June 7, 2012 ### Terry Moore: Why is 'x' the unknown? - YouTube One of the may proofs how Arabs take credit for many of the technological advances we know now. We use basis they set long ago until this day. Watch this very interesting video about Why is 'x' the unknown for Terry Moore. ### How to set the default dictionary in Firefox Most likely when you have multiple dictionaries, you wont have alot of control on what would the default dictionary be. In order to resolve this issue, we will need to set the default language for the dictionary in the Firefox web configuration screen: 1. Type about:config in the URL bar 2. Search for spellchecker.dictionary 3. In case you want English to be your default dictionary, you set it to en_US ## Tuesday, June 5, 2012 ### Barton Library Dataset If you are interested in downloading the Barton Libraries Dataset for benchmarking you can use the following link to download the dataset. It actually took me sometime to find it since the default link was not working. Download the dataset from the Internet Archive ## Friday, June 1, 2012 ### View All Running Processes on Mac OSX Linux users would notice that the ps command does not behave the same under Mac OSX as it does under linux. It seems that the ps command under Mac OSX uses other parameters in order to for example see all running processes. For Linux the command to use is $ps -aux
Under Mac OSX the command to use will be
\$ps -eax