Mahout -- 2回目
今日も、samples を動かしてみる事にする。
今日は、分類(Classification)のサンプルを動かす。
classify-20newsgroups.sh
READMEには、以下の説明。
classify-20newsgroups.sh -- Run SGD and Bayes classifiers over the classic 20 News Groups. Downloads the data set automatically.
では、前回と同様に、実行
MBA-20120331:examples guutara$ pwd /Users/guutara/Desktop/mahout-distribution-0.7/examples MBA-20120331:examples guutara$ bin/classify-20newsgroups.sh 1 &> log
logを残したくなったので、上記のように、実行したよ。
引数の1は、以下の選択の番号です。
MBA-20120331:examples guutara$ bin/classify-20newsgroups.sh Please select a number to choose the corresponding task to run 1. cnaivebayes 2. naivebayes 3. sgd 4. clean -- cleans up the work area in /tmp/mahout-work-guutara
最後に、テスト結果が表示される。
+ echo 'Self testing on training set' Self testing on training set + ./bin/mahout testnb -i /tmp/mahout-work-guutara/20news-train-vectors -m /tmp/mahout-work-guutara/model -l /tmp/mahout-work-guutara/labelindex -ow -o /tmp/mahout-work-guutara/20news-testing -c MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath. hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java... SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 13/01/09 13:56:38 WARN driver.MahoutDriver: No testnb.props found on classpath, will use command-line arguments only 13/01/09 13:56:38 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/tmp/mahout-work-guutara/20news-train-vectors], --labelIndex=[/tmp/mahout-work-guutara/labelindex], --model=[/tmp/mahout-work-guutara/model], --output=[/tmp/mahout-work-guutara/20news-testing], --overwrite=null, --startPhase=[0], --tempDir=[temp], --testComplementary=null} 2013-01-09 13:56:38.669 java[24024:c0b] Unable to load realm info from SCDynamicStore 13/01/09 13:56:38 INFO common.HadoopUtil: Deleting /tmp/mahout-work-guutara/20news-testing 13/01/09 13:56:39 INFO input.FileInputFormat: Total input paths to process : 1 13/01/09 13:56:39 INFO filecache.TrackerDistributedCacheManager: Creating model in /tmp/hadoop-guutara/mapred/local/archive/-3367352280950352653_7000557_497732096/file/tmp/mahout-work-guutara-work--7062815705982847828 with rwxr-xr-x 13/01/09 13:56:39 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-guutara/model as /tmp/hadoop-guutara/mapred/local/archive/-3367352280950352653_7000557_497732096/file/tmp/mahout-work-guutara/model 13/01/09 13:56:39 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-guutara/model as /tmp/hadoop-guutara/mapred/local/archive/-3367352280950352653_7000557_497732096/file/tmp/mahout-work-guutara/model 13/01/09 13:56:39 INFO mapred.JobClient: Running job: job_local_0001 13/01/09 13:56:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13/01/09 13:56:39 INFO compress.CodecPool: Got brand-new decompressor Setup 13/01/09 13:56:40 INFO mapred.JobClient: map 0% reduce 0% 13/01/09 13:56:45 INFO mapred.LocalJobRunner: 13/01/09 13:56:46 INFO mapred.JobClient: map 58% reduce 0% 13/01/09 13:56:48 INFO mapred.LocalJobRunner: 13/01/09 13:56:49 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 13/01/09 13:56:49 INFO mapred.LocalJobRunner: 13/01/09 13:56:49 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now 13/01/09 13:56:49 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to /tmp/mahout-work-guutara/20news-testing 13/01/09 13:56:49 INFO mapred.JobClient: map 89% reduce 0% 13/01/09 13:56:51 INFO mapred.LocalJobRunner: 13/01/09 13:56:51 INFO mapred.LocalJobRunner: 13/01/09 13:56:51 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done. 13/01/09 13:56:52 INFO mapred.JobClient: map 100% reduce 0% 13/01/09 13:56:52 INFO mapred.JobClient: Job complete: job_local_0001 13/01/09 13:56:52 INFO mapred.JobClient: Counters: 8 13/01/09 13:56:52 INFO mapred.JobClient: File Output Format Counters 13/01/09 13:56:52 INFO mapred.JobClient: Bytes Written=2136074 13/01/09 13:56:52 INFO mapred.JobClient: File Input Format Counters 13/01/09 13:56:52 INFO mapred.JobClient: Bytes Read=12787156 13/01/09 13:56:52 INFO mapred.JobClient: FileSystemCounters 13/01/09 13:56:52 INFO mapred.JobClient: FILE_BYTES_READ=31568835 13/01/09 13:56:52 INFO mapred.JobClient: FILE_BYTES_WRITTEN=17328654 13/01/09 13:56:52 INFO mapred.JobClient: Map-Reduce Framework 13/01/09 13:56:52 INFO mapred.JobClient: Map input records=11254 13/01/09 13:56:52 INFO mapred.JobClient: Spilled Records=0 13/01/09 13:56:52 INFO mapred.JobClient: SPLIT_RAW_BYTES=128 13/01/09 13:56:52 INFO mapred.JobClient: Map output records=11254 13/01/09 13:56:52 INFO test.TestNaiveBayesDriver: Complementary Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 11150 99.0759% Incorrectly Classified Instances : 104 0.9241% Total Classified Instances : 11254 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i j k l m n o p q r s t <--Classified as 488 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 | 490 a = alt.atheism 0 573 0 1 1 3 1 0 0 0 0 0 0 2 1 0 0 0 0 0 | 582 b = comp.graphics 0 6 551 25 2 5 2 0 0 0 0 0 1 0 2 0 0 0 0 0 | 594 c = comp.os.ms-windows.misc 0 0 0 565 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 | 566 d = comp.sys.ibm.pc.hardware 0 0 1 1 559 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 | 562 e = comp.sys.mac.hardware 0 3 0 1 0 573 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 577 f = comp.windows.x 0 0 0 0 0 0 580 0 0 0 1 0 3 0 0 0 0 0 0 0 | 584 g = misc.forsale 0 0 0 0 1 0 0 581 0 0 0 0 2 0 0 0 0 0 0 1 | 585 h = rec.autos 0 0 0 0 0 0 1 1 590 0 0 0 0 0 0 0 0 0 0 0 | 592 i = rec.motorcycles 0 0 0 0 0 0 0 0 0 574 3 0 0 0 0 0 0 0 0 0 | 577 j = rec.sport.baseball 0 0 0 1 0 0 1 0 0 1 597 0 0 0 0 0 0 0 0 1 | 601 k = rec.sport.hockey 0 0 0 0 0 0 0 0 0 0 0 588 0 0 0 0 0 0 0 0 | 588 l = sci.crypt 0 0 0 4 0 0 0 0 0 0 0 0 589 0 0 0 0 0 0 0 | 593 m = sci.electronics 0 1 0 0 0 0 0 0 0 0 0 0 2 622 0 0 0 0 0 0 | 625 n = sci.med 0 0 0 0 0 0 0 0 0 0 0 0 0 1 603 0 0 0 0 0 | 604 o = sci.space 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 589 1 0 0 0 | 590 p = soc.religion.christian 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 564 0 0 0 | 565 q = talk.politics.mideast 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 546 0 0 | 548 r = talk.politics.guns 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 369 1 | 379 s = talk.religion.misc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 449 | 452 t = talk.politics.misc
13/01/09 13:56:52 INFO driver.MahoutDriver: Program took 14498 ms (Minutes: 0.24163333333333334) + echo 'Testing on holdout set' Testing on holdout set + ./bin/mahout testnb -i /tmp/mahout-work-guutara/20news-test-vectors -m /tmp/mahout-work-guutara/model -l /tmp/mahout-work-guutara/labelindex -ow -o /tmp/mahout-work-guutara/20news-testing -c MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath. hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java... SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 13/01/09 13:56:53 WARN driver.MahoutDriver: No testnb.props found on classpath, will use command-line arguments only 13/01/09 13:56:53 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/tmp/mahout-work-guutara/20news-test-vectors], --labelIndex=[/tmp/mahout-work-guutara/labelindex], --model=[/tmp/mahout-work-guutara/model], --output=[/tmp/mahout-work-guutara/20news-testing], --overwrite=null, --startPhase=[0], --tempDir=[temp], --testComplementary=null} 2013-01-09 13:56:53.842 java[24034:c0b] Unable to load realm info from SCDynamicStore 13/01/09 13:56:53 INFO common.HadoopUtil: Deleting /tmp/mahout-work-guutara/20news-testing 13/01/09 13:56:54 INFO input.FileInputFormat: Total input paths to process : 1 13/01/09 13:56:54 INFO filecache.TrackerDistributedCacheManager: Creating model in /tmp/hadoop-guutara/mapred/local/archive/96753755695796741_7000557_497732096/file/tmp/mahout-work-guutara-work-213440611316171756 with rwxr-xr-x 13/01/09 13:56:54 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-guutara/model as /tmp/hadoop-guutara/mapred/local/archive/96753755695796741_7000557_497732096/file/tmp/mahout-work-guutara/model 13/01/09 13:56:54 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-guutara/model as /tmp/hadoop-guutara/mapred/local/archive/96753755695796741_7000557_497732096/file/tmp/mahout-work-guutara/model 13/01/09 13:56:54 INFO mapred.JobClient: Running job: job_local_0001 13/01/09 13:56:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13/01/09 13:56:54 INFO compress.CodecPool: Got brand-new decompressor Setup 13/01/09 13:56:55 INFO mapred.JobClient: map 0% reduce 0% 13/01/09 13:57:00 INFO mapred.LocalJobRunner: 13/01/09 13:57:01 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 13/01/09 13:57:01 INFO mapred.LocalJobRunner: 13/01/09 13:57:01 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now 13/01/09 13:57:01 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to /tmp/mahout-work-guutara/20news-testing 13/01/09 13:57:01 INFO mapred.JobClient: map 86% reduce 0% 13/01/09 13:57:03 INFO mapred.LocalJobRunner: 13/01/09 13:57:03 INFO mapred.LocalJobRunner: 13/01/09 13:57:03 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done. 13/01/09 13:57:04 INFO mapred.JobClient: map 100% reduce 0% 13/01/09 13:57:04 INFO mapred.JobClient: Job complete: job_local_0001 13/01/09 13:57:04 INFO mapred.JobClient: Counters: 8 13/01/09 13:57:04 INFO mapred.JobClient: File Output Format Counters 13/01/09 13:57:04 INFO mapred.JobClient: Bytes Written=1442110 13/01/09 13:57:04 INFO mapred.JobClient: File Input Format Counters 13/01/09 13:57:04 INFO mapred.JobClient: Bytes Read=8584811 13/01/09 13:57:04 INFO mapred.JobClient: FileSystemCounters 13/01/09 13:57:04 INFO mapred.JobClient: FILE_BYTES_READ=27366489 13/01/09 13:57:04 INFO mapred.JobClient: FILE_BYTES_WRITTEN=16634680 13/01/09 13:57:04 INFO mapred.JobClient: Map-Reduce Framework 13/01/09 13:57:04 INFO mapred.JobClient: Map input records=7592 13/01/09 13:57:04 INFO mapred.JobClient: Spilled Records=0 13/01/09 13:57:04 INFO mapred.JobClient: SPLIT_RAW_BYTES=127 13/01/09 13:57:04 INFO mapred.JobClient: Map output records=7592 13/01/09 13:57:04 INFO test.TestNaiveBayesDriver: Complementary Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 6797 89.5285% Incorrectly Classified Instances : 795 10.4715% Total Classified Instances : 7592 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i j k l m n o p q r s t <--Classified as 289 0 0 0 0 0 0 0 0 0 0 0 0 1 1 3 0 0 10 5 | 309 a = alt.atheism 1 327 1 19 4 18 6 0 1 0 0 3 6 0 3 1 0 0 0 1 | 391 b = comp.graphics 2 37 182 87 22 28 13 0 0 0 0 6 3 1 2 0 0 0 3 5 | 391 c = comp.os.ms-windows.misc 1 6 1 362 19 4 9 1 0 0 1 2 8 0 1 0 0 0 1 0 | 416 d = comp.sys.ibm.pc.hardware 1 4 1 12 368 2 6 1 0 0 0 2 4 0 0 0 0 0 0 0 | 401 e = comp.sys.mac.hardware 0 15 2 6 1 377 2 1 0 1 0 0 1 0 4 0 0 0 0 1 | 411 f = comp.windows.x 0 1 0 18 8 0 333 10 2 2 1 2 7 1 3 0 0 3 0 0 | 391 g = misc.forsale 0 0 0 2 2 1 8 377 7 0 0 0 2 1 1 0 0 3 0 1 | 405 h = rec.autos 0 0 0 1 1 0 5 7 385 0 0 0 1 2 1 0 0 0 1 0 | 404 i = rec.motorcycles 0 0 0 0 2 0 1 0 2 407 3 0 1 0 1 0 0 0 0 0 | 417 j = rec.sport.baseball 0 0 0 1 0 0 0 0 1 5 391 0 0 0 0 0 0 0 0 0 | 398 k = rec.sport.hockey 0 3 1 0 0 1 0 1 0 0 0 385 3 2 1 0 0 4 0 2 | 403 l = sci.crypt 0 6 0 17 8 2 5 2 1 1 0 1 345 0 2 0 0 1 0 0 | 391 m = sci.electronics 2 2 0 1 3 2 4 2 1 0 0 1 1 341 2 0 0 2 0 1 | 365 n = sci.med 1 4 0 0 1 2 0 0 0 1 0 0 1 3 365 0 1 0 3 1 | 383 o = sci.space 7 1 0 1 0 0 0 0 0 0 1 0 0 3 0 383 1 0 10 0 | 407 p = soc.religion.christian 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 363 1 0 4 | 375 q = talk.politics.mideast 0 0 0 0 0 1 0 1 2 2 0 3 0 1 0 0 0 341 0 11 | 362 r = talk.politics.guns 35 1 0 1 0 0 0 2 0 1 1 0 0 0 0 9 2 6 184 7 | 249 s = talk.religion.misc 1 0 0 1 2 0 0 1 0 1 0 2 0 2 3 0 7 10 1 292 | 323 t = talk.politics.misc 13/01/09 13:57:04 INFO driver.MahoutDriver: Program took 11437 ms (Minutes: 0.19061666666666666)
面白いかも。。
テストの結果の見方とかは、このあたりを、読んでみた。。