Monday, November 19, 2012

Place figures and tables correctly in Latex

One way to force Latex to place your figures and tables where you want is by using the "float" package. You need to specify an "H" directive whenever you add a table or figure. The package usage is as follows:


.. Attributes

The same scenario applies for tables. The float package allows not only placing tables and figures in the desired location but also is text friendly in the sense that it plays nice when you have text in between figures/tables.

Wednesday, July 18, 2012

NetBeans SVN Subversion Error 500

There seems to be an issue with NetBeans SVN client relating to using an old cached (username/password) which pops up when it is changed. In essence the IDE complains with many subversion operations as follows:
org.tigris.subversion.javahl.clientexception: OPTIONS of 500 Internal Server Error

The solution to the problem is to restart the IDE with

./netbeans -J-DsvnClientAdapterFactory=commandline

This should force a dialog to popup and ask about the (username/password)

Thursday, July 12, 2012

Hadoop vs Google

An interesting article that discusses  "Why the days are numbered for Hadoop as we know it" . It illustrates how Google has been the main driver to the Hadoop platform and how Hadoop may fail if it does not catch up with Google's stack progress.

Wednesday, July 4, 2012

KMeans Clustering Using Apache Mahout

This post summarizes the steps necessary to cluster a set of documents  (Reuters dataset as an example) and shows how to obtain a document to cluster mapping.

Step 1 : Convert the input text to sequences

The first step is to convert the input documents into a sequence form using seqdirectory . This is a necessary step before being able to perform KMeans clustering:

$mahout seqdirectory -i ~/Downloads/reuters21578/parsedtext -o ~/Downloads/reuters21578/parsedtext-seqdir -c UTF-8 -chunk 5

Step 2:  Convert the generated sequences to sparse vectors using seq2sparse

It is necessary to obtain vectors from sequence file to use it as part of the input to the clustering process.

$mahout seq2sparse -i ~/Downloads/reuters21578/parsedtext-seqdir -o ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector

Step 3: Run the KMeans clustering

We now run the KMeans clustering. Please refer to the mahout documentation for a description about the KMeans parameters.

$mahout kmeans -i ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans/tfidf-vectors/ -c ~/Downloads/reuters21578/parsedtext-kmeans-clusters -o ~/Downloads/reuters21578/parsedtext-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering -cl

Step 4: Dump the results to file

In order to view the clustering results , it needs to be dumped through the use of a special program called clusterdump. Note that we specify with -o the output of the dump.

$mahout clusterdump -s ~/Downloads/reuters21578/parsedtext-kmeans/clusters-*-final -d ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir ~/Downloads/reuters21578/parsedtext-kmeans/clusteredPoints -o ~/cluster-output.txt

Step 5: Dump the Document to Cluster mapping

The cluster mapping can be generated by using the seqdumper program. The ID of the cluster is specified by "Key: <clusterid>" in the file and the filename is specified with "value: <documentid>". The output file will contain other fields which we will crop out in Step 6.

$mahout seqdumper -s ~/Downloads/reuters21578/parsedtext-kmeans/clusteredPoints/part-m-00000 > cluster-points.txt

Step 6: Extract the Document to Cluster mapping

The following perl script extracts the cluster-id and document-id from the dump file:

# Script to extract the doc-cluster mapping from Mahout output

open(FILEOUT,"> $ARGV[1]");

 if($_ =~ /Key: (.+?): (.+?) vec: \/(.+?) =/){
  print FILEOUT "$1\t$3\n";

close FILEOUT;
print "Done\n";

To run the script , simply type on the command line:

$perl cluster-points.txt cluster-mapping.txt 
The cluster-mapping.txt should now contain a tab-delimited set of document to cluster mapping.

Thursday, June 7, 2012

Terry Moore: Why is 'x' the unknown? - YouTube

One of the may proofs how Arabs take credit for many of the technological advances  we know now. We use basis they set long ago until this day. Watch this very interesting video about Why is 'x' the unknown for Terry Moore.

How to set the default dictionary in Firefox

Most likely when you have multiple dictionaries, you wont have alot of control on what would the default dictionary be. In order to resolve this issue, we will need to set the default language for the dictionary in the Firefox web configuration screen:

  1. Type  about:config in the URL bar
  2. Search for spellchecker.dictionary
  3. In case you want English to be your default dictionary, you set it to en_US

Tuesday, June 5, 2012

Barton Library Dataset

If you are interested in downloading the Barton Libraries Dataset for benchmarking you can use the following link to download the dataset. It actually took me sometime to find it since the default link was not working.

Download the dataset from the Internet Archive

Friday, June 1, 2012

View All Running Processes on Mac OSX

Linux users would notice that the ps command does not behave the same under Mac OSX as it does under linux. It seems that the ps command under Mac OSX uses other parameters in order to for example see all running processes.

For Linux the command to use is
$ps -aux
Under Mac OSX the command to use will be
$ps -eax

Wednesday, January 11, 2012

Egypt Global Competitveness Situation , In Plain Numbers

Global Competitiveness Ranking for Egypt (2011-2012)
Egypt has a very challenging phase ahead since the January 25th Revolution. The following numbers represent the baseline that the Egyptian Government and the Egyptian People have to tackle in the next phase.
Global Competitive Index (GCI)
GCI 2011–2012(out of 142) .............. 94
GCI 2010–2011 (out of 139)............81
GCI 2009–2010 (out of 133)..............70

Basic requirements (44.2%) ....99

Institutions ....74
Macroeconomic environment .....132
Health and primary education....96
Efficiency enhancers (46.8%)......94
Higher education and training.....107
Goods market efficiency....118
Labor market efficiency....141
Financial market development .....92
Technological readiness........95
Market size .....27
Innovation and sophistication factors (8.9%) ....86
Business sophistication ....72
Source : Global Competitiveness Report 2011-2012