|
|
-
streaming secondary sort not working?Paul Hubenig 2013-01-08, 02:38
hadoop jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.*.jar \ -input /export/home/phubenig/fileDataInput \ -output /export/home/phubenig/fileDataOutput \ -mapper /export/home/phubenig/fileDataJob/non_map.py \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \ num.key.fields.for.partition=1 \ stream.num.map.output.key.fields=7 \ mapred.text.key.comparator.options="-k1,1 -k2,7" \ mapred.text.key.partitioner.options="-k1,1" \ -file /export/home/phubenig/fileDataInput/fileData.txt ~~~~~~~~~~~~ Input file (tab separated): C k d m n h b A w g i w t l A w f y m y h C u r d h c b A y q w m g k B w b s d q g C q j j d f b C l n x a g f C o r m a g p C v w l a t f B c l f n t u B x t o e x p A q m r d q i C e i o u g l A x m w u o i A j p m d k r C s t m r m t B s w l f k y B a f r v f x A s z d v s h C o x j c w r Sorts on first key (the capital letters) but does not perform the secondary sort on the other fields. Does anyone see the problem? What am I missing? Seems like it should work. Thanks for your time. Paul non_map.py: #!/usr/bin/env python import sys for line in sys.stdin: stripped = line.rstrip() print(stripped) |