Wednesday, July 4, 2012

KMeans Clustering Using Apache Mahout

This post summarizes the steps necessary to cluster a set of documents  (Reuters dataset as an example) and shows how to obtain a document to cluster mapping.

Step 1 : Convert the input text to sequences

The first step is to convert the input documents into a sequence form using seqdirectory . This is a necessary step before being able to perform KMeans clustering:

$mahout seqdirectory -i ~/Downloads/reuters21578/parsedtext -o ~/Downloads/reuters21578/parsedtext-seqdir -c UTF-8 -chunk 5

Step 2:  Convert the generated sequences to sparse vectors using seq2sparse

It is necessary to obtain vectors from sequence file to use it as part of the input to the clustering process.

$mahout seq2sparse -i ~/Downloads/reuters21578/parsedtext-seqdir -o ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector

Step 3: Run the KMeans clustering

We now run the KMeans clustering. Please refer to the mahout documentation for a description about the KMeans parameters.

$mahout kmeans -i ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans/tfidf-vectors/ -c ~/Downloads/reuters21578/parsedtext-kmeans-clusters -o ~/Downloads/reuters21578/parsedtext-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering -cl

Step 4: Dump the results to file

In order to view the clustering results , it needs to be dumped through the use of a special program called clusterdump. Note that we specify with -o the output of the dump.

$mahout clusterdump -s ~/Downloads/reuters21578/parsedtext-kmeans/clusters-*-final -d ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir ~/Downloads/reuters21578/parsedtext-kmeans/clusteredPoints -o ~/cluster-output.txt

Step 5: Dump the Document to Cluster mapping

The cluster mapping can be generated by using the seqdumper program. The ID of the cluster is specified by "Key: <clusterid>" in the file and the filename is specified with "value: <documentid>". The output file will contain other fields which we will crop out in Step 6.

$mahout seqdumper -s ~/Downloads/reuters21578/parsedtext-kmeans/clusteredPoints/part-m-00000 > cluster-points.txt

Step 6: Extract the Document to Cluster mapping

The following perl script extracts the cluster-id and document-id from the dump file:

# Script to extract the doc-cluster mapping from Mahout output

open(OUTPUT,"$ARGV[0]");
open(FILEOUT,"> $ARGV[1]");

while(<OUTPUT>){
 
 chomp;
 
 if($_ =~ /Key: (.+?): (.+?) vec: \/(.+?) =/){
  print FILEOUT "$1\t$3\n";
 }
 
}

close FILEOUT;
print "Done\n";


To run the script , simply type on the command line:

$perl parseDump.pl cluster-points.txt cluster-mapping.txt 
  
The cluster-mapping.txt should now contain a tab-delimited set of document to cluster mapping.

29 comments:

  1. Hi!

    I love your very neatly written blog!
    I've tried a lot of tutorials on this kmeans, but all does not goes well, except yours!
    At least i've finally managed to finish executing my kmeans... phew..
    Step 4: Dump the results to file

    My Mahout was 0.7 and i got the error :
    ERROR common.AbstractJob: Unexpected -s while processing Job-Specific Options:

    Looks like "-s" is not supported, and i googled and found that "-i" might be appropriate.
    I changed it to "-i" and it seems going but still stop with the following error.

    Exception in thread "main" java.lang.IllegalStateException: Job failed!
    at org.apache.mahout.clustering.evaluation.RepresentativePointsDriver.runIterationMR(RepresentativePointsDriver.java:248)
    at org.apache.mahout.clustering.evaluation.RepresentativePointsDriver.runIteration(RepresentativePointsDriver.java:162)
    at org.apache.mahout.clustering.evaluation.RepresentativePointsDriver.run(RepresentativePointsDriver.java:127)
    at org.apache.mahout.clustering.evaluation.RepresentativePointsDriver.run(RepresentativePointsDriver.java:90)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.evaluation.RepresentativePointsDriver.main(RepresentativePointsDriver.java:67)
    at org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:193)
    at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:153)
    at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:102)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

    I know it's hard to ask anyone to check on the error, but if you ever handled with mahout 0.7, maybe you can give me some advice?

    Thank you so much.



    I'm facing a problem after this,,

    ReplyDelete
    Replies
    1. I had the same problem with the 0.7 version. It seems there is something wrong in the updated implementation since it cannot process the same mentioned command line arguments. Im currently running the 0.6 version with no problems.

      Good Luck

      Delete
    2. Is there any workaround for this problem when using 0.7 ?

      My setup is using Hadoop 1.1.1 and Mahout 0.7. I wouldn't want to switch to Mahout 0.6 if there is a work-around.

      I would even try to fix the issue if that is possible in 0.7.

      Thanks!

      Delete
    3. I found the problem, in my case I was reading from (-i file:/tmp/input-seq ) local filesystem and writing to local file system ( -o file:/tmp/output-seq-sparse ). However mahout somehow assumes that the intermediate files are generated on HDFS. This was causing intermediate files to fail. Now I am reading from local file system and writing to HDFS ( -o /tmp/output-seq-sparse ).

      This fixed my problem.

      Delete
  2. SIR,
    i wanted to know can we implement minimum spanning tree based clustering (graph based) in hadoop?if yes is there any graph based(MST) technique/algorithm already implemented in mahout.....which i can use to understand..

    ReplyDelete
    Replies
    1. I would suggest you look at Gigraph

      http://incubator.apache.org/giraph/

      Good Luck

      Delete
  3. thanks for your post
    but when running mahout kmeans it complains about the -c parameter
    will thank for a complete guide

    ReplyDelete
    Replies
    1. I would suggest you try Apache Mahout 0.6, this is the version i used to write this article.

      Delete
  4. HI Amgad Madkour

    have a requirement in which i have to perform clustering on the Japaneses text. Can i do this with Mahout K-means and lucene CJK Analyser.

    Thanks

    ReplyDelete
    Replies
    1. I didnt attempt the combination of Lucene with Mahout before. Theoretically both should work together though.

      Good Luck

      Delete
  5. Great tutorial!

    ReplyDelete
  6. Hi Amgad,

    Very nice tutorial,Thanks a lot,i am able to execute all the steps successfully but the last step,step 6 containing the perl script simply hangs even after 30 mins without giving outpout.how much time it will take to process the records.I had done perl scripting on GB size file and returned me results within 5 mins.Can you please provide any other alternative or solution.

    ReplyDelete
  7. Ya sorry about that. Please see the updated code.

    Good Luck

    ReplyDelete
  8. afet Running all steps whatever in tutorial but didn't understand the text Analysis pls give more information


    Thanks
    Amrendra Singh

    ReplyDelete
    Replies
    1. The final results will give you a mapping between each document and a cluster number. This indicates that documents in a specific cluster exhibit some form of similarity based on the features each document has.

      Delete
  9. Very helpful, nice post.....Thanks a lot ! :)

    ReplyDelete
  10. what is this file ???? -->parsedtext-kmeans-clusters<--

    ReplyDelete
  11. I am using CDH4.6 cloudera vm and installed mahout 0.7 version(not tarball) on it.When I am running command like mahout org.apache.lucene.benchmark.utils.ExtractReuters ,it gives me error like

    No org.lucene.benchmark.utils.ExtractReuters.props found on classpath, will use command-line arguments only
    Unknown program 'org.lucene.benchmark.utils.ExtractReuters' chosen

    and exception like

    WARN driver.MahoutDriver: Unable to add class: org.lucene.benchmark.utils.ExtractReuters
    java.lang.ClassNotFoundException: org.lucene.benchmark.utils.ExtractReuters.

    Please help me to resolve this issue

    ReplyDelete
  12. hai, i am using Sandbox and i dint get ~/Downloads/reuters21578/parsedtext path and whats there in parse text???

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. Hi,
    All has been run from your scripts but not able to do the last part that is using perl. Just made a file named parsedump.pl and saved under mahout directory, running from terminal,the perl command as mentioned and when i am running it, its only showing :Done: but on file cluster-mapping.txt is blank, no output is gathering. What to do, stuck only in last step?

    ReplyDelete
    Replies
    1. I am stuck on the same problem as well. Could you figure it out?

      Delete
    2. Did you guys find the solution yet?

      Delete
  15. i stuck in step 3 .. with error "no cluster found in .... check your -c argument.. about part-random seed. anyone can help me ?

    ReplyDelete
  16. informative blog to read.. explanation of the blog is very clear so easy to understand

    hadoop training in velachery | big data training in velachery

    ReplyDelete
  17. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

    hadoop training in chennai tambaram | big data training in chennai tambaram

    ReplyDelete