Mahout -- 2回目

今日も、samples を動かしてみる事にする。

今日は、分類(Classification)のサンプルを動かす。

classify-20newsgroups.sh

READMEには、以下の説明。

classify-20newsgroups.sh -- Run SGD and Bayes classifiers over the classic 20 News Groups.  
Downloads the data set automatically.

では、前回と同様に、実行

MBA-20120331:examples guutara$ pwd
/Users/guutara/Desktop/mahout-distribution-0.7/examples

MBA-20120331:examples guutara$ bin/classify-20newsgroups.sh 1 &> log

logを残したくなったので、上記のように、実行したよ。
引数の1は、以下の選択の番号です。

MBA-20120331:examples guutara$ bin/classify-20newsgroups.sh 
Please select a number to choose the corresponding task to run
1. cnaivebayes
2. naivebayes
3. sgd
4. clean -- cleans up the work area in /tmp/mahout-work-guutara

最後に、テスト結果が表示される。

+ echo 'Self testing on training set'
Self testing on training set
+ ./bin/mahout testnb -i /tmp/mahout-work-guutara/20news-train-vectors -m /tmp/mahout-work-guutara/model -l /tmp/mahout-work-guutara/labelindex -ow -o /tmp/mahout-work-guutara/20news-testing -c
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
13/01/09 13:56:38 WARN driver.MahoutDriver: No testnb.props found on classpath, will use command-line arguments only
13/01/09 13:56:38 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/tmp/mahout-work-guutara/20news-train-vectors], --labelIndex=[/tmp/mahout-work-guutara/labelindex], --model=[/tmp/mahout-work-guutara/model], --output=[/tmp/mahout-work-guutara/20news-testing], --overwrite=null, --startPhase=[0], --tempDir=[temp], --testComplementary=null}
2013-01-09 13:56:38.669 java[24024:c0b] Unable to load realm info from SCDynamicStore
13/01/09 13:56:38 INFO common.HadoopUtil: Deleting /tmp/mahout-work-guutara/20news-testing
13/01/09 13:56:39 INFO input.FileInputFormat: Total input paths to process : 1
13/01/09 13:56:39 INFO filecache.TrackerDistributedCacheManager: Creating model in /tmp/hadoop-guutara/mapred/local/archive/-3367352280950352653_7000557_497732096/file/tmp/mahout-work-guutara-work--7062815705982847828 with rwxr-xr-x
13/01/09 13:56:39 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-guutara/model as /tmp/hadoop-guutara/mapred/local/archive/-3367352280950352653_7000557_497732096/file/tmp/mahout-work-guutara/model
13/01/09 13:56:39 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-guutara/model as /tmp/hadoop-guutara/mapred/local/archive/-3367352280950352653_7000557_497732096/file/tmp/mahout-work-guutara/model
13/01/09 13:56:39 INFO mapred.JobClient: Running job: job_local_0001
13/01/09 13:56:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/01/09 13:56:39 INFO compress.CodecPool: Got brand-new decompressor
Setup
13/01/09 13:56:40 INFO mapred.JobClient:  map 0% reduce 0%
13/01/09 13:56:45 INFO mapred.LocalJobRunner: 
13/01/09 13:56:46 INFO mapred.JobClient:  map 58% reduce 0%
13/01/09 13:56:48 INFO mapred.LocalJobRunner: 
13/01/09 13:56:49 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/01/09 13:56:49 INFO mapred.LocalJobRunner: 
13/01/09 13:56:49 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now
13/01/09 13:56:49 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to /tmp/mahout-work-guutara/20news-testing
13/01/09 13:56:49 INFO mapred.JobClient:  map 89% reduce 0%
13/01/09 13:56:51 INFO mapred.LocalJobRunner: 
13/01/09 13:56:51 INFO mapred.LocalJobRunner: 
13/01/09 13:56:51 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/01/09 13:56:52 INFO mapred.JobClient:  map 100% reduce 0%
13/01/09 13:56:52 INFO mapred.JobClient: Job complete: job_local_0001
13/01/09 13:56:52 INFO mapred.JobClient: Counters: 8
13/01/09 13:56:52 INFO mapred.JobClient:   File Output Format Counters 
13/01/09 13:56:52 INFO mapred.JobClient:     Bytes Written=2136074
13/01/09 13:56:52 INFO mapred.JobClient:   File Input Format Counters 
13/01/09 13:56:52 INFO mapred.JobClient:     Bytes Read=12787156
13/01/09 13:56:52 INFO mapred.JobClient:   FileSystemCounters
13/01/09 13:56:52 INFO mapred.JobClient:     FILE_BYTES_READ=31568835
13/01/09 13:56:52 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=17328654
13/01/09 13:56:52 INFO mapred.JobClient:   Map-Reduce Framework
13/01/09 13:56:52 INFO mapred.JobClient:     Map input records=11254
13/01/09 13:56:52 INFO mapred.JobClient:     Spilled Records=0
13/01/09 13:56:52 INFO mapred.JobClient:     SPLIT_RAW_BYTES=128
13/01/09 13:56:52 INFO mapred.JobClient:     Map output records=11254
13/01/09 13:56:52 INFO test.TestNaiveBayesDriver: Complementary Results: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :      11150	   99.0759%
Incorrectly Classified Instances        :        104	    0.9241%
Total Classified Instances              :      11254

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	h    	i    	j    	k    	l    	m    	n    	o    	p    	q    	r    	s    	t    	<--Classified as
488  	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	2    	0    	 |  490   	a     = alt.atheism
0    	573  	0    	1    	1    	3    	1    	0    	0    	0    	0    	0    	0    	2    	1    	0    	0    	0    	0    	0    	 |  582   	b     = comp.graphics
0    	6    	551  	25   	2    	5    	2    	0    	0    	0    	0    	0    	1    	0    	2    	0    	0    	0    	0    	0    	 |  594   	c     = comp.os.ms-windows.misc
0    	0    	0    	565  	0    	0    	1    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	 |  566   	d     = comp.sys.ibm.pc.hardware
0    	0    	1    	1    	559  	0    	0    	0    	0    	0    	0    	0    	1    	0    	0    	0    	0    	0    	0    	0    	 |  562   	e     = comp.sys.mac.hardware
0    	3    	0    	1    	0    	573  	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	 |  577   	f     = comp.windows.x
0    	0    	0    	0    	0    	0    	580  	0    	0    	0    	1    	0    	3    	0    	0    	0    	0    	0    	0    	0    	 |  584   	g     = misc.forsale
0    	0    	0    	0    	1    	0    	0    	581  	0    	0    	0    	0    	2    	0    	0    	0    	0    	0    	0    	1    	 |  585   	h     = rec.autos
0    	0    	0    	0    	0    	0    	1    	1    	590  	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	 |  592   	i     = rec.motorcycles
0    	0    	0    	0    	0    	0    	0    	0    	0    	574  	3    	0    	0    	0    	0    	0    	0    	0    	0    	0    	 |  577   	j     = rec.sport.baseball
0    	0    	0    	1    	0    	0    	1    	0    	0    	1    	597  	0    	0    	0    	0    	0    	0    	0    	0    	1    	 |  601   	k     = rec.sport.hockey
0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	588  	0    	0    	0    	0    	0    	0    	0    	0    	 |  588   	l     = sci.crypt
0    	0    	0    	4    	0    	0    	0    	0    	0    	0    	0    	0    	589  	0    	0    	0    	0    	0    	0    	0    	 |  593   	m     = sci.electronics
0    	1    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	2    	622  	0    	0    	0    	0    	0    	0    	 |  625   	n     = sci.med
0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	1    	603  	0    	0    	0    	0    	0    	 |  604   	o     = sci.space
0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	589  	1    	0    	0    	0    	 |  590   	p     = soc.religion.christian
0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	1    	564  	0    	0    	0    	 |  565   	q     = talk.politics.mideast
0    	0    	1    	0    	0    	0    	0    	0    	0    	0    	0    	1    	0    	0    	0    	0    	0    	546  	0    	0    	 |  548   	r     = talk.politics.guns
7    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	1    	0    	1    	369  	1    	 |  379   	s     = talk.religion.misc
0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	3    	0    	449  	 |  452   	t     = talk.politics.misc
13/01/09 13:56:52 INFO driver.MahoutDriver: Program took 14498 ms (Minutes: 0.24163333333333334)
+ echo 'Testing on holdout set'
Testing on holdout set
+ ./bin/mahout testnb -i /tmp/mahout-work-guutara/20news-test-vectors -m /tmp/mahout-work-guutara/model -l /tmp/mahout-work-guutara/labelindex -ow -o /tmp/mahout-work-guutara/20news-testing -c
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/guutara/Desktop/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
13/01/09 13:56:53 WARN driver.MahoutDriver: No testnb.props found on classpath, will use command-line arguments only
13/01/09 13:56:53 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/tmp/mahout-work-guutara/20news-test-vectors], --labelIndex=[/tmp/mahout-work-guutara/labelindex], --model=[/tmp/mahout-work-guutara/model], --output=[/tmp/mahout-work-guutara/20news-testing], --overwrite=null, --startPhase=[0], --tempDir=[temp], --testComplementary=null}
2013-01-09 13:56:53.842 java[24034:c0b] Unable to load realm info from SCDynamicStore
13/01/09 13:56:53 INFO common.HadoopUtil: Deleting /tmp/mahout-work-guutara/20news-testing
13/01/09 13:56:54 INFO input.FileInputFormat: Total input paths to process : 1
13/01/09 13:56:54 INFO filecache.TrackerDistributedCacheManager: Creating model in /tmp/hadoop-guutara/mapred/local/archive/96753755695796741_7000557_497732096/file/tmp/mahout-work-guutara-work-213440611316171756 with rwxr-xr-x
13/01/09 13:56:54 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-guutara/model as /tmp/hadoop-guutara/mapred/local/archive/96753755695796741_7000557_497732096/file/tmp/mahout-work-guutara/model
13/01/09 13:56:54 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-guutara/model as /tmp/hadoop-guutara/mapred/local/archive/96753755695796741_7000557_497732096/file/tmp/mahout-work-guutara/model
13/01/09 13:56:54 INFO mapred.JobClient: Running job: job_local_0001
13/01/09 13:56:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/01/09 13:56:54 INFO compress.CodecPool: Got brand-new decompressor
Setup
13/01/09 13:56:55 INFO mapred.JobClient:  map 0% reduce 0%
13/01/09 13:57:00 INFO mapred.LocalJobRunner: 
13/01/09 13:57:01 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/01/09 13:57:01 INFO mapred.LocalJobRunner: 
13/01/09 13:57:01 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now
13/01/09 13:57:01 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to /tmp/mahout-work-guutara/20news-testing
13/01/09 13:57:01 INFO mapred.JobClient:  map 86% reduce 0%
13/01/09 13:57:03 INFO mapred.LocalJobRunner: 
13/01/09 13:57:03 INFO mapred.LocalJobRunner: 
13/01/09 13:57:03 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/01/09 13:57:04 INFO mapred.JobClient:  map 100% reduce 0%
13/01/09 13:57:04 INFO mapred.JobClient: Job complete: job_local_0001
13/01/09 13:57:04 INFO mapred.JobClient: Counters: 8
13/01/09 13:57:04 INFO mapred.JobClient:   File Output Format Counters 
13/01/09 13:57:04 INFO mapred.JobClient:     Bytes Written=1442110
13/01/09 13:57:04 INFO mapred.JobClient:   File Input Format Counters 
13/01/09 13:57:04 INFO mapred.JobClient:     Bytes Read=8584811
13/01/09 13:57:04 INFO mapred.JobClient:   FileSystemCounters
13/01/09 13:57:04 INFO mapred.JobClient:     FILE_BYTES_READ=27366489
13/01/09 13:57:04 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=16634680
13/01/09 13:57:04 INFO mapred.JobClient:   Map-Reduce Framework
13/01/09 13:57:04 INFO mapred.JobClient:     Map input records=7592
13/01/09 13:57:04 INFO mapred.JobClient:     Spilled Records=0
13/01/09 13:57:04 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
13/01/09 13:57:04 INFO mapred.JobClient:     Map output records=7592
13/01/09 13:57:04 INFO test.TestNaiveBayesDriver: Complementary Results: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       6797	   89.5285%
Incorrectly Classified Instances        :        795	   10.4715%
Total Classified Instances              :       7592

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	h    	i    	j    	k    	l    	m    	n    	o    	p    	q    	r    	s    	t    	<--Classified as
289  	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	0    	1    	1    	3    	0    	0    	10   	5    	 |  309   	a     = alt.atheism
1    	327  	1    	19   	4    	18   	6    	0    	1    	0    	0    	3    	6    	0    	3    	1    	0    	0    	0    	1    	 |  391   	b     = comp.graphics
2    	37   	182  	87   	22   	28   	13   	0    	0    	0    	0    	6    	3    	1    	2    	0    	0    	0    	3    	5    	 |  391   	c     = comp.os.ms-windows.misc
1    	6    	1    	362  	19   	4    	9    	1    	0    	0    	1    	2    	8    	0    	1    	0    	0    	0    	1    	0    	 |  416   	d     = comp.sys.ibm.pc.hardware
1    	4    	1    	12   	368  	2    	6    	1    	0    	0    	0    	2    	4    	0    	0    	0    	0    	0    	0    	0    	 |  401   	e     = comp.sys.mac.hardware
0    	15   	2    	6    	1    	377  	2    	1    	0    	1    	0    	0    	1    	0    	4    	0    	0    	0    	0    	1    	 |  411   	f     = comp.windows.x
0    	1    	0    	18   	8    	0    	333  	10   	2    	2    	1    	2    	7    	1    	3    	0    	0    	3    	0    	0    	 |  391   	g     = misc.forsale
0    	0    	0    	2    	2    	1    	8    	377  	7    	0    	0    	0    	2    	1    	1    	0    	0    	3    	0    	1    	 |  405   	h     = rec.autos
0    	0    	0    	1    	1    	0    	5    	7    	385  	0    	0    	0    	1    	2    	1    	0    	0    	0    	1    	0    	 |  404   	i     = rec.motorcycles
0    	0    	0    	0    	2    	0    	1    	0    	2    	407  	3    	0    	1    	0    	1    	0    	0    	0    	0    	0    	 |  417   	j     = rec.sport.baseball
0    	0    	0    	1    	0    	0    	0    	0    	1    	5    	391  	0    	0    	0    	0    	0    	0    	0    	0    	0    	 |  398   	k     = rec.sport.hockey
0    	3    	1    	0    	0    	1    	0    	1    	0    	0    	0    	385  	3    	2    	1    	0    	0    	4    	0    	2    	 |  403   	l     = sci.crypt
0    	6    	0    	17   	8    	2    	5    	2    	1    	1    	0    	1    	345  	0    	2    	0    	0    	1    	0    	0    	 |  391   	m     = sci.electronics
2    	2    	0    	1    	3    	2    	4    	2    	1    	0    	0    	1    	1    	341  	2    	0    	0    	2    	0    	1    	 |  365   	n     = sci.med
1    	4    	0    	0    	1    	2    	0    	0    	0    	1    	0    	0    	1    	3    	365  	0    	1    	0    	3    	1    	 |  383   	o     = sci.space
7    	1    	0    	1    	0    	0    	0    	0    	0    	0    	1    	0    	0    	3    	0    	383  	1    	0    	10   	0    	 |  407   	p     = soc.religion.christian
1    	1    	0    	0    	1    	0    	0    	0    	0    	1    	1    	0    	0    	0    	1    	1    	363  	1    	0    	4    	 |  375   	q     = talk.politics.mideast
0    	0    	0    	0    	0    	1    	0    	1    	2    	2    	0    	3    	0    	1    	0    	0    	0    	341  	0    	11   	 |  362   	r     = talk.politics.guns
35   	1    	0    	1    	0    	0    	0    	2    	0    	1    	1    	0    	0    	0    	0    	9    	2    	6    	184  	7    	 |  249   	s     = talk.religion.misc
1    	0    	0    	1    	2    	0    	0    	1    	0    	1    	0    	2    	0    	2    	3    	0    	7    	10   	1    	292  	 |  323   	t     = talk.politics.misc


13/01/09 13:57:04 INFO driver.MahoutDriver: Program took 11437 ms (Minutes: 0.19061666666666666)

面白いかも。。

テストの結果の見方とかは、このあたりを、読んでみた。。

Apache Mahout: 万人のためのスケーラブルな機械学習
Apache Mahout の紹介