Mahout を試す -- 1回目

機械学習のライブラリで、ちょこっと、遊んでみるかなと。

ここから、zipを落としてきて、MBAで、弄る事にする。
zipの中には、必要な jar は、全てあるので、楽チンである。

http://www.apache.org/dyn/closer.cgi/mahout/

からたどって、

http://ftp.jaist.ac.jp/pub/apache/mahout/0.7/mahout-distribution-0.7.zip

を落として、作業開始。

まずは、動かす。

動かさないと、何も分からんので。

まずは、mahoutのスクリプトをterminalで叩いて、動かす。
zipを展開して、binのディレクトリへ、移動した。

MBA-20120331:~ guutara$ cd Desktop/mahout-distribution-0.7/bin/
MBA-20120331:bin guutara$ ls
mahout

MBA-20120331:bin guutara$ cat mahout 
#!/bin/bash
#
# The Mahout command script
#
# Environment Variables
#
#   MAHOUT_JAVA_HOME   The java implementation to use.  Overrides JAVA_HOME.
#
#   MAHOUT_HEAPSIZE    The maximum amount of heap to use, in MB.
#                      Default is 1000.
#
#   HADOOP_CONF_DIR  The location of a hadoop config directory
#
#   MAHOUT_OPTS        Extra Java runtime options.
#
#   MAHOUT_CONF_DIR    The location of the program short-name to class name
#                      mappings and the default properties files
#                      defaults to "$MAHOUT_HOME/src/conf"
#
#   MAHOUT_LOCAL       set to anything other than an empty string to force
#                      mahout to run locally even if
#                      HADOOP_CONF_DIR and HADOOP_HOME are set
#
#   MAHOUT_CORE        set to anything other than an empty string to force
#                      mahout to run in developer 'core' mode, just as if the
#                      -core option was presented on the command-line
# Commane-line Options
#
#   -core              -core is used to switch into 'developer mode' when
#                      running mahout locally. If specified, the classes
#                      from the 'target/classes' directories in each project
#                      are used. Otherwise classes will be retrived from
#                      jars in the binary releas collection or *-job.jar files
#                      found in build directories. When running on hadoop
#                      the job files will always be used.

#
#/**
# * Licensed to the Apache Software Foundation (ASF) under one or more
# * contributor license agreements.  See the NOTICE file distributed with
# * this work for additional information regarding copyright ownership.
# * The ASF licenses this file to You under the Apache License, Version 2.0
# * (the "License"); you may not use this file except in compliance with
# * the License.  You may obtain a copy of the License at
# *
# *     http://www.apache.org/licenses/LICENSE-2.0
# *
# * Unless required by applicable law or agreed to in writing, software
# * distributed under the License is distributed on an "AS IS" BASIS,
# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# * See the License for the specific language governing permissions and
# * limitations under the License.
# */

cygwin=false
case "`uname`" in
CYGWIN*) cygwin=true;;
esac

# resolve links - $0 may be a softlink
THIS="$0"
while [ -h "$THIS" ]; do
  ls=`ls -ld "$THIS"`
  link=`expr "$ls" : '.*-> \(.*\)$'`
  if expr "$link" : '.*/.*' > /dev/null; then
    THIS="$link"
  else
    THIS=`dirname "$THIS"`/"$link"
  fi
done

IS_CORE=0
if [ "$1" == "-core" ] ; then
  IS_CORE=1
  shift
fi

if [ "$MAHOUT_CORE" != "" ]; then
  IS_CORE=1
fi

# some directories
THIS_DIR=`dirname "$THIS"`
MAHOUT_HOME=`cd "$THIS_DIR/.." ; pwd`

# some Java parameters
if [ "$MAHOUT_JAVA_HOME" != "" ]; then
  #echo "run java in $MAHOUT_JAVA_HOME"
  JAVA_HOME=$MAHOUT_JAVA_HOME
fi

if [ "$JAVA_HOME" = "" ]; then
  echo "Error: JAVA_HOME is not set."
  exit 1
fi

JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx3g

# check envvars which might override default args
if [ "$MAHOUT_HEAPSIZE" != "" ]; then
  #echo "run with heapsize $MAHOUT_HEAPSIZE"
  JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
  #echo $JAVA_HEAP_MAX
fi

if [ "x$MAHOUT_CONF_DIR" = "x" ]; then
  if [ -d $MAHOUT_HOME/src/conf ]; then
    MAHOUT_CONF_DIR=$MAHOUT_HOME/src/conf
  else
    if [ -d $MAHOUT_HOME/conf ]; then
      MAHOUT_CONF_DIR=$MAHOUT_HOME/conf
    else
      echo No MAHOUT_CONF_DIR found
    fi
  fi
fi

# CLASSPATH initially contains $MAHOUT_CONF_DIR, or defaults to $MAHOUT_HOME/src/conf
CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR

if [ "$MAHOUT_LOCAL" != "" ]; then
  echo "MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath."
elif [ -n "$HADOOP_CONF_DIR"  ] ; then
  echo "MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath."
  CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR
fi

CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar

# so that filenames w/ spaces are handled correctly in loops below
IFS=

if [ $IS_CORE == 0 ]
then
  # add release dependencies to CLASSPATH
  for f in $MAHOUT_HOME/mahout-*.jar; do
    CLASSPATH=${CLASSPATH}:$f;
  done

  # add dev targets if they exist
  for f in $MAHOUT_HOME/examples/target/mahout-examples-*-job.jar $MAHOUT_HOME/mahout-examples-*-job.jar ; do
    CLASSPATH=${CLASSPATH}:$f;
  done

  # add release dependencies to CLASSPATH
  for f in $MAHOUT_HOME/lib/*.jar; do
    CLASSPATH=${CLASSPATH}:$f;
  done
else
  CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/math/target/classes
  CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/core/target/classes
  CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/integration/target/classes
  CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/examples/target/classes
  #CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/core/src/main/resources
fi

# add development dependencies to CLASSPATH
for f in $MAHOUT_HOME/examples/target/dependency/*.jar; do
  CLASSPATH=${CLASSPATH}:$f;
done


# cygwin path translation
if $cygwin; then
  CLASSPATH=`cygpath -p -w "$CLASSPATH"`
fi

# restore ordinary behaviour
unset IFS

# default log directory & file
if [ "$MAHOUT_LOG_DIR" = "" ]; then
  MAHOUT_LOG_DIR="$MAHOUT_HOME/logs"
fi
if [ "$MAHOUT_LOGFILE" = "" ]; then
  MAHOUT_LOGFILE='mahout.log'
fi

#Fix log path under cygwin
if $cygwin; then
  MAHOUT_LOG_DIR=`cygpath -p -w "$MAHOUT_LOG_DIR"`
fi

MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.dir=$MAHOUT_LOG_DIR"
MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.file=$MAHOUT_LOGFILE"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.min.split.size=512MB"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.child.java.opts=-Xmx4096m"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.child.java.opts=-Xmx4096m"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.output.compress=true"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.compress.map.output=true"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=1"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=1"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.factor=30"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.mb=1024"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.file.buffer.size=32786"

if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
  MAHOUT_OPTS="$MAHOUT_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH"
fi

CLASS=org.apache.mahout.driver.MahoutDriver

for f in $MAHOUT_HOME/examples/target/mahout-examples-*-job.jar $MAHOUT_HOME/mahout-examples-*-job.jar ; do
  if [ -e "$f" ]; then
    MAHOUT_JOB=$f
  fi
done

# run it

HADOOP_BINARY=$(PATH="${HADOOP_HOME:-${HADOOP_PREFIX}}/bin:$PATH" which hadoop 2>/dev/null)
if [ -x "$HADOOP_BINARY" ] ; then
  HADOOP_BINARY_CLASSPATH=$("$HADOOP_BINARY" classpath)
fi

if [ ! -x "$HADOOP_BINARY" ] || [ "$MAHOUT_LOCAL" != "" ] ; then
  if [ ! -x "$HADOOP_BINARY" ] ; then
    echo "hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally"
  elif [ "$MAHOUT_LOCAL" != "" ] ; then
    echo "MAHOUT_LOCAL is set, running locally"
  fi
#  echo "CLASSPATH: $CLASSPATH"
    CLASSPATH="${CLASSPATH}:${MAHOUT_HOME/lib/hadoop/*}"
    case $1 in
    (classpath)
      echo $CLASSPATH
      ;;
    (*)
      exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"
    esac
else
  echo "Running on hadoop, using $HADOOP_BINARY and HADOOP_CONF_DIR=$HADOOP_CONF_DIR"

  if [ "$MAHOUT_JOB" = "" ] ; then
    echo "ERROR: Could not find mahout-examples-*.job in $MAHOUT_HOME or $MAHOUT_HOME/examples/target, please run 'mvn install' to create the .job file"
    exit 1
  else
    case "$1" in
    (hadoop)
      shift
      export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH
      exec "$HADOOP_BINARY" "$@"
      ;;
    (classpath)
      echo $CLASSPATH
      ;;
    (*)
      echo "MAHOUT-JOB: $MAHOUT_JOB"
      export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
      exec "$HADOOP_BINARY" jar $MAHOUT_JOB $CLASS "$@"
    esac
  fi
fi

MBA-20120331:bin guutara$


長い。。orz
とりあえず、javaを設定しないと。

MBA-20120331:bin guutara$ which java
/usr/bin/java
MBA-20120331:bin guutara$ export JAVA_HOME=/usr

で、叩いてみた。

MBA-20120331:bin guutara$ ./mahout
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
An example program must be given as the first argument.
Valid program names are:
  arff.vector: : Generate Vectors from an ARFF file or directory
  baumwelch: : Baum-Welch algorithm for unsupervised HMM training
  canopy: : Canopy clustering
  cat: : Print a file or resource as the logistic regression models would see it
  cleansvd: : Cleanup and verification of SVD output
  clusterdump: : Dump cluster output to text
  clusterpp: : Groups Clustering Output In Clusters
  cmdump: : Dump confusion matrix in HTML or text formats
  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
  dirichlet: : Dirichlet Clustering
  eigencuts: : Eigencuts spectral clustering
  evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
  fkmeans: : Fuzzy K-means clustering
  fpg: : Frequent Pattern Growth
  hmmpredict: : Generate random sequence of observations by given HMM
  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
  kmeans: : K-means clustering
  lucene.vector: : Generate Vectors from a Lucene index
  matrixdump: : Dump matrix in CSV format
  matrixmult: : Take the product of two matrices
  meanshift: : Mean Shift clustering
  minhash: : Run Minhash clustering
  parallelALS: : ALS-WR factorization of a rating matrix
  recommendfactorized: : Compute recommendations using the factorization of a rating matrix
  recommenditembased: : Compute recommendations using item-based collaborative filtering
  regexconverter: : Convert text files on a per line basis based on regular expressions
  rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
  runlogistic: : Run a logistic regression model against CSV data
  seq2encoded: : Encoded Sparse Vector generation from Text sequence files
  seq2sparse: : Sparse Vector generation from Text sequence files
  seqdirectory: : Generate sequence files (of Text) from a directory
  seqdumper: : Generic Sequence File dumper
  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
  seqwiki: : Wikipedia xml dump to sequence file
  spectralkmeans: : Spectral k-means clustering
  split: : Split Input data into test and train sets
  splitDataset: : split a rating dataset into training and probe parts
  ssvd: : Stochastic SVD
  svd: : Lanczos Singular Value Decomposition
  testnb: : Test the Vector-based Bayes classifier
  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
  trainlogistic: : Train a logistic regression using stochastic gradient descent
  trainnb: : Train the Vector-based Bayes classifier
  transpose: : Take the transpose of a matrix
  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
  vectordump: : Dump vectors from a sequence file to text
  viterbi: : Viterbi decoding of hidden states from given output states sequence

よくわからない。w

sampleを、動かしてみる事にする。
sample/bin のディレクトリへ移動した。

MBA-20120331:examples guutara$ pwd
/Users/guutara/Desktop/mahout-distribution-0.7/examples

MBA-20120331:bin guutara$ cd bin
MBA-20120331:bin guutara$ ls
README.txt				build-reuters.sh			factorize-movielens-1M.sh
asf-email-examples.sh			classify-20newsgroups.sh		factorize-netflix.sh
build-asf-email.sh			cluster-reuters.sh			lda.algorithm
build-cluster-syntheticcontrol.sh	cluster-syntheticcontrol.sh		tmp
MBA-20120331:bin guutara$ cat build-reuters.sh 
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

#
# Downloads the Reuters dataset and prepares it for clustering
#
# To run:  change into the mahout directory and type:
#  examples/bin/build-reuters.sh
echo "Please call cluster-reuters.sh directly next time.  This file is going away."
./cluster-reuters.sh 

じゃ、cluster-reuters.shを動かしてみようかな。

MBA-20120331:bin guutara$ ./cluster-reuters.sh 
Please select a number to choose the corresponding clustering algorithm
1. kmeans clustering
2. fuzzykmeans clustering
3. dirichlet clustering
4. minhash clustering
Enter your choice : 1
ok. You chose 1 and we'll use kmeans Clustering
creating work directory at /tmp/mahout-work-guutara
Downloading Reuters-21578
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7959k  100 7959k    0     0   220k      0  0:00:36  0:00:36 --:--:--  229k
Extracting...
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	... 1 more
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	... 1 more
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	... 1 more

失敗してる。。
もういっかい。

MBA-20120331:bin guutara$ cd ../
MBA-20120331:examples guutara$ ls
bin	src	target

MBA-20120331:examples guutara$ ln -s ../lib/hadoop/hadoop-core-0.20.204.0.jar ../lib/
MBA-20120331:examples guutara$ bin/cluster-reuters.sh 

うーん、よくわからないけど、どうも、binのディレクトリからだと、うまくいかない。
で、Exsampleのディレクトリから実行することにした。
それと、Hadoopのjarを、libの配下へ、ln -s もした。
そうしたら、動いたよ。。