Mahout を試す -- 1回目
機械学習のライブラリで、ちょこっと、遊んでみるかなと。
ここから、zipを落としてきて、MBAで、弄る事にする。
zipの中には、必要な jar は、全てあるので、楽チンである。
http://www.apache.org/dyn/closer.cgi/mahout/
からたどって、
http://ftp.jaist.ac.jp/pub/apache/mahout/0.7/mahout-distribution-0.7.zip
を落として、作業開始。
まずは、動かす。
動かさないと、何も分からんので。
まずは、mahoutのスクリプトをterminalで叩いて、動かす。
zipを展開して、binのディレクトリへ、移動した。
MBA-20120331:~ guutara$ cd Desktop/mahout-distribution-0.7/bin/ MBA-20120331:bin guutara$ ls mahout MBA-20120331:bin guutara$ cat mahout #!/bin/bash # # The Mahout command script # # Environment Variables # # MAHOUT_JAVA_HOME The java implementation to use. Overrides JAVA_HOME. # # MAHOUT_HEAPSIZE The maximum amount of heap to use, in MB. # Default is 1000. # # HADOOP_CONF_DIR The location of a hadoop config directory # # MAHOUT_OPTS Extra Java runtime options. # # MAHOUT_CONF_DIR The location of the program short-name to class name # mappings and the default properties files # defaults to "$MAHOUT_HOME/src/conf" # # MAHOUT_LOCAL set to anything other than an empty string to force # mahout to run locally even if # HADOOP_CONF_DIR and HADOOP_HOME are set # # MAHOUT_CORE set to anything other than an empty string to force # mahout to run in developer 'core' mode, just as if the # -core option was presented on the command-line # Commane-line Options # # -core -core is used to switch into 'developer mode' when # running mahout locally. If specified, the classes # from the 'target/classes' directories in each project # are used. Otherwise classes will be retrived from # jars in the binary releas collection or *-job.jar files # found in build directories. When running on hadoop # the job files will always be used. # #/** # * Licensed to the Apache Software Foundation (ASF) under one or more # * contributor license agreements. See the NOTICE file distributed with # * this work for additional information regarding copyright ownership. # * The ASF licenses this file to You under the Apache License, Version 2.0 # * (the "License"); you may not use this file except in compliance with # * the License. You may obtain a copy of the License at # * # * http://www.apache.org/licenses/LICENSE-2.0 # * # * Unless required by applicable law or agreed to in writing, software # * distributed under the License is distributed on an "AS IS" BASIS, # * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # * See the License for the specific language governing permissions and # * limitations under the License. # */ cygwin=false case "`uname`" in CYGWIN*) cygwin=true;; esac # resolve links - $0 may be a softlink THIS="$0" while [ -h "$THIS" ]; do ls=`ls -ld "$THIS"` link=`expr "$ls" : '.*-> \(.*\)$'` if expr "$link" : '.*/.*' > /dev/null; then THIS="$link" else THIS=`dirname "$THIS"`/"$link" fi done IS_CORE=0 if [ "$1" == "-core" ] ; then IS_CORE=1 shift fi if [ "$MAHOUT_CORE" != "" ]; then IS_CORE=1 fi # some directories THIS_DIR=`dirname "$THIS"` MAHOUT_HOME=`cd "$THIS_DIR/.." ; pwd` # some Java parameters if [ "$MAHOUT_JAVA_HOME" != "" ]; then #echo "run java in $MAHOUT_JAVA_HOME" JAVA_HOME=$MAHOUT_JAVA_HOME fi if [ "$JAVA_HOME" = "" ]; then echo "Error: JAVA_HOME is not set." exit 1 fi JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx3g # check envvars which might override default args if [ "$MAHOUT_HEAPSIZE" != "" ]; then #echo "run with heapsize $MAHOUT_HEAPSIZE" JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m" #echo $JAVA_HEAP_MAX fi if [ "x$MAHOUT_CONF_DIR" = "x" ]; then if [ -d $MAHOUT_HOME/src/conf ]; then MAHOUT_CONF_DIR=$MAHOUT_HOME/src/conf else if [ -d $MAHOUT_HOME/conf ]; then MAHOUT_CONF_DIR=$MAHOUT_HOME/conf else echo No MAHOUT_CONF_DIR found fi fi fi # CLASSPATH initially contains $MAHOUT_CONF_DIR, or defaults to $MAHOUT_HOME/src/conf CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR if [ "$MAHOUT_LOCAL" != "" ]; then echo "MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath." elif [ -n "$HADOOP_CONF_DIR" ] ; then echo "MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath." CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR fi CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar # so that filenames w/ spaces are handled correctly in loops below IFS= if [ $IS_CORE == 0 ] then # add release dependencies to CLASSPATH for f in $MAHOUT_HOME/mahout-*.jar; do CLASSPATH=${CLASSPATH}:$f; done # add dev targets if they exist for f in $MAHOUT_HOME/examples/target/mahout-examples-*-job.jar $MAHOUT_HOME/mahout-examples-*-job.jar ; do CLASSPATH=${CLASSPATH}:$f; done # add release dependencies to CLASSPATH for f in $MAHOUT_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done else CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/math/target/classes CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/core/target/classes CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/integration/target/classes CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/examples/target/classes #CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/core/src/main/resources fi # add development dependencies to CLASSPATH for f in $MAHOUT_HOME/examples/target/dependency/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # cygwin path translation if $cygwin; then CLASSPATH=`cygpath -p -w "$CLASSPATH"` fi # restore ordinary behaviour unset IFS # default log directory & file if [ "$MAHOUT_LOG_DIR" = "" ]; then MAHOUT_LOG_DIR="$MAHOUT_HOME/logs" fi if [ "$MAHOUT_LOGFILE" = "" ]; then MAHOUT_LOGFILE='mahout.log' fi #Fix log path under cygwin if $cygwin; then MAHOUT_LOG_DIR=`cygpath -p -w "$MAHOUT_LOG_DIR"` fi MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.dir=$MAHOUT_LOG_DIR" MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.file=$MAHOUT_LOGFILE" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.min.split.size=512MB" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.child.java.opts=-Xmx4096m" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.child.java.opts=-Xmx4096m" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.output.compress=true" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.compress.map.output=true" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=1" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=1" MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.factor=30" MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.mb=1024" MAHOUT_OPTS="$MAHOUT_OPTS -Dio.file.buffer.size=32786" if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then MAHOUT_OPTS="$MAHOUT_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH" fi CLASS=org.apache.mahout.driver.MahoutDriver for f in $MAHOUT_HOME/examples/target/mahout-examples-*-job.jar $MAHOUT_HOME/mahout-examples-*-job.jar ; do if [ -e "$f" ]; then MAHOUT_JOB=$f fi done # run it HADOOP_BINARY=$(PATH="${HADOOP_HOME:-${HADOOP_PREFIX}}/bin:$PATH" which hadoop 2>/dev/null) if [ -x "$HADOOP_BINARY" ] ; then HADOOP_BINARY_CLASSPATH=$("$HADOOP_BINARY" classpath) fi if [ ! -x "$HADOOP_BINARY" ] || [ "$MAHOUT_LOCAL" != "" ] ; then if [ ! -x "$HADOOP_BINARY" ] ; then echo "hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally" elif [ "$MAHOUT_LOCAL" != "" ] ; then echo "MAHOUT_LOCAL is set, running locally" fi # echo "CLASSPATH: $CLASSPATH" CLASSPATH="${CLASSPATH}:${MAHOUT_HOME/lib/hadoop/*}" case $1 in (classpath) echo $CLASSPATH ;; (*) exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@" esac else echo "Running on hadoop, using $HADOOP_BINARY and HADOOP_CONF_DIR=$HADOOP_CONF_DIR" if [ "$MAHOUT_JOB" = "" ] ; then echo "ERROR: Could not find mahout-examples-*.job in $MAHOUT_HOME or $MAHOUT_HOME/examples/target, please run 'mvn install' to create the .job file" exit 1 else case "$1" in (hadoop) shift export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH exec "$HADOOP_BINARY" "$@" ;; (classpath) echo $CLASSPATH ;; (*) echo "MAHOUT-JOB: $MAHOUT_JOB" export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH} exec "$HADOOP_BINARY" jar $MAHOUT_JOB $CLASS "$@" esac fi fi MBA-20120331:bin guutara$
長い。。orz
とりあえず、javaを設定しないと。
MBA-20120331:bin guutara$ which java /usr/bin/java MBA-20120331:bin guutara$ export JAVA_HOME=/usr
で、叩いてみた。
MBA-20120331:bin guutara$ ./mahout hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java... SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. An example program must be given as the first argument. Valid program names are: arff.vector: : Generate Vectors from an ARFF file or directory baumwelch: : Baum-Welch algorithm for unsupervised HMM training canopy: : Canopy clustering cat: : Print a file or resource as the logistic regression models would see it cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text clusterpp: : Groups Clustering Output In Clusters cmdump: : Dump confusion matrix in HTML or text formats cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx) cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally. dirichlet: : Dirichlet Clustering eigencuts: : Eigencuts spectral clustering evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes fkmeans: : Fuzzy K-means clustering fpg: : Frequent Pattern Growth hmmpredict: : Generate random sequence of observations by given HMM itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering kmeans: : K-means clustering lucene.vector: : Generate Vectors from a Lucene index matrixdump: : Dump matrix in CSV format matrixmult: : Take the product of two matrices meanshift: : Mean Shift clustering minhash: : Run Minhash clustering parallelALS: : ALS-WR factorization of a rating matrix recommendfactorized: : Compute recommendations using the factorization of a rating matrix recommenditembased: : Compute recommendations using item-based collaborative filtering regexconverter: : Convert text files on a per line basis based on regular expressions rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>} rowsimilarity: : Compute the pairwise similarities of the rows of a matrix runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model runlogistic: : Run a logistic regression model against CSV data seq2encoded: : Encoded Sparse Vector generation from Text sequence files seq2sparse: : Sparse Vector generation from Text sequence files seqdirectory: : Generate sequence files (of Text) from a directory seqdumper: : Generic Sequence File dumper seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives seqwiki: : Wikipedia xml dump to sequence file spectralkmeans: : Spectral k-means clustering split: : Split Input data into test and train sets splitDataset: : split a rating dataset into training and probe parts ssvd: : Stochastic SVD svd: : Lanczos Singular Value Decomposition testnb: : Test the Vector-based Bayes classifier trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model trainlogistic: : Train a logistic regression using stochastic gradient descent trainnb: : Train the Vector-based Bayes classifier transpose: : Take the transpose of a matrix validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors vectordump: : Dump vectors from a sequence file to text viterbi: : Viterbi decoding of hidden states from given output states sequence
よくわからない。w
sampleを、動かしてみる事にする。
sample/bin のディレクトリへ移動した。
MBA-20120331:examples guutara$ pwd /Users/guutara/Desktop/mahout-distribution-0.7/examples MBA-20120331:bin guutara$ cd bin MBA-20120331:bin guutara$ ls README.txt build-reuters.sh factorize-movielens-1M.sh asf-email-examples.sh classify-20newsgroups.sh factorize-netflix.sh build-asf-email.sh cluster-reuters.sh lda.algorithm build-cluster-syntheticcontrol.sh cluster-syntheticcontrol.sh tmp MBA-20120331:bin guutara$ cat build-reuters.sh #!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # # Downloads the Reuters dataset and prepares it for clustering # # To run: change into the mahout directory and type: # examples/bin/build-reuters.sh echo "Please call cluster-reuters.sh directly next time. This file is going away." ./cluster-reuters.sh
じゃ、cluster-reuters.shを動かしてみようかな。
MBA-20120331:bin guutara$ ./cluster-reuters.sh Please select a number to choose the corresponding clustering algorithm 1. kmeans clustering 2. fuzzykmeans clustering 3. dirichlet clustering 4. minhash clustering Enter your choice : 1 ok. You chose 1 and we'll use kmeans Clustering creating work directory at /tmp/mahout-work-guutara Downloading Reuters-21578 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7959k 100 7959k 0 0 220k 0 0:00:36 0:00:36 --:--:-- 229k Extracting... hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java... SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 1 more MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath. hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java... SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 1 more hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java... SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 1 more
失敗してる。。
もういっかい。
MBA-20120331:bin guutara$ cd ../ MBA-20120331:examples guutara$ ls bin src target MBA-20120331:examples guutara$ ln -s ../lib/hadoop/hadoop-core-0.20.204.0.jar ../lib/ MBA-20120331:examples guutara$ bin/cluster-reuters.sh
うーん、よくわからないけど、どうも、binのディレクトリからだと、うまくいかない。
で、Exsampleのディレクトリから実行することにした。
それと、Hadoopのjarを、libの配下へ、ln -s もした。
そうしたら、動いたよ。。