Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> the different size of file through hadoop streaming


Copy link to this message
-
the different size of file through hadoop streaming
hello,
I process a file using hadoop streaming. but I found streaming will add
byte 0x09 before 0x0a. So the file is changed after streaming process.
some one can tells me why add this byte to output?

[zhouhh@Hadoop48 ~]$ ls -l README.txt
-rw-r--r-- 1 zhouhh zhouhh 1399 Feb  1 10:53 README.txt

[zhouhh@Hadoop48 ~]$ wc README.txt
  34  182 1399 README.txt
[zhouhh@Hadoop48 ~]$ hadoop fs -ls
Found 3 items
-rw-r--r--   2 zhouhh supergroup       9358 2013-01-10 17:52
/user/zhouhh/fsimage
drwxr-xr-x   - zhouhh supergroup          0 2013-02-01 10:30 /user/zhouhh/gz
-rw-r--r--   2 zhouhh supergroup         65 2012-09-26 14:10
/user/zhouhh/test中文.txt
[zhouhh@Hadoop48 ~]$ hadoop fs -put README.txt .
[zhouhh@Hadoop48 ~]$ hadoop fs -ls
Found 4 items
-rw-r--r--   2 zhouhh supergroup       1399 2013-02-01 10:56
/user/zhouhh/README.txt
-rw-r--r--   2 zhouhh supergroup       9358 2013-01-10 17:52
/user/zhouhh/fsimage
drwxr-xr-x   - zhouhh supergroup          0 2013-02-01 10:30 /user/zhouhh/gz
-rw-r--r--   2 zhouhh supergroup         65 2012-09-26 14:10
/user/zhouhh/test中文.txt
[zhouhh@Hadoop48 ~]$ hadoop fs -ls README.txt
Found 1 items
-rw-r--r--   2 zhouhh supergroup       1399 2013-02-01 10:56
/user/zhouhh/README.txt

[zhouhh@Hadoop48 ~]$ hadoop jar
 $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input README.txt
-output wordcount1 -mapper /bin/cat -reducer /bin/sort
[zhouhh@Hadoop48 ~]$ hadoop fs -ls wordcount/part*
Found 1 items
-rw-r--r--   2 zhouhh supergroup       *1433* 2013-02-01 11:20
/user/zhouhh/wordcount/part-00000
[zhouhh@Hadoop48 ~]$ hadoop jar
 $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input README.txt
-output wordcount1 -mapper /bin/cat -reducer /usr/bin/wc

[zhouhh@Hadoop48 ~]$ hadoop fs -cat wordcount1/p*
     34     182    *1433*

part of the two file of hex code:
sort README.txt  :
                                     streaming README.txt and reduce sort:
  0000000: 0a0a 0a0a 0a0a 0a61 6c67 6f72 6974 686d  .......algorithm
           |  0000000: *090a 090a 090a 090a 090a 090a 090a* 616c
 ..............al
  0000010: 732e 2020 5468 6520 666f 726d 2061 6e64  s.  The form and
           |  0000010: 676f 7269 7468 6d73 2e20 2054 6865 2066  gorithms.
 The f
  0000020: 206d 616e 6e65 7220 6f66 2074 6869 7320   manner of this
          |  0000020: 6f72 6d20 616e 6420 6d61 6e6e 6572 206f  orm and
manner o
  0000030: 4170 6163 6865 2053 6f66 7477 6172 6520  Apache Software
          |  0000030: 6620 7468 6973 2041 7061 6368 6520 536f  f this
Apache So
  0000040: 466f 756e 6461 7469 6f6e 0a61 6e64 206f  Foundation.and o
           |  0000040: 6674 7761 7265 2046 6f75 6e64 6174 696f  ftware
Foundatio

because there are 34 lines, so the file size add 34 of 09 byte.
1399+34=1433. why?

Best regards,
Andy
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB