Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> the different size of file through hadoop streaming


Copy link to this message
-
the different size of file through hadoop streaming
hello,
I process a file using hadoop streaming. but I found streaming will add
byte 0x09 before 0x0a. So the file is changed after streaming process.
some one can tells me why add this byte to output?

[zhouhh@Hadoop48 ~]$ ls -l README.txt
-rw-r--r-- 1 zhouhh zhouhh 1399 Feb  1 10:53 README.txt

[zhouhh@Hadoop48 ~]$ wc README.txt
  34  182 1399 README.txt
[zhouhh@Hadoop48 ~]$ hadoop fs -ls
Found 3 items
-rw-r--r--   2 zhouhh supergroup       9358 2013-01-10 17:52
/user/zhouhh/fsimage
drwxr-xr-x   - zhouhh supergroup          0 2013-02-01 10:30 /user/zhouhh/gz
-rw-r--r--   2 zhouhh supergroup         65 2012-09-26 14:10
/user/zhouhh/test中文.txt
[zhouhh@Hadoop48 ~]$ hadoop fs -put README.txt .
[zhouhh@Hadoop48 ~]$ hadoop fs -ls
Found 4 items
-rw-r--r--   2 zhouhh supergroup       1399 2013-02-01 10:56
/user/zhouhh/README.txt
-rw-r--r--   2 zhouhh supergroup       9358 2013-01-10 17:52
/user/zhouhh/fsimage
drwxr-xr-x   - zhouhh supergroup          0 2013-02-01 10:30 /user/zhouhh/gz
-rw-r--r--   2 zhouhh supergroup         65 2012-09-26 14:10
/user/zhouhh/test中文.txt
[zhouhh@Hadoop48 ~]$ hadoop fs -ls README.txt
Found 1 items
-rw-r--r--   2 zhouhh supergroup       1399 2013-02-01 10:56
/user/zhouhh/README.txt

[zhouhh@Hadoop48 ~]$ hadoop jar
 $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input README.txt
-output wordcount1 -mapper /bin/cat -reducer /bin/sort
[zhouhh@Hadoop48 ~]$ hadoop fs -ls wordcount/part*
Found 1 items
-rw-r--r--   2 zhouhh supergroup       *1433* 2013-02-01 11:20
/user/zhouhh/wordcount/part-00000
[zhouhh@Hadoop48 ~]$ hadoop jar
 $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input README.txt
-output wordcount1 -mapper /bin/cat -reducer /usr/bin/wc

[zhouhh@Hadoop48 ~]$ hadoop fs -cat wordcount1/p*
     34     182    *1433*

part of the two file of hex code:
sort README.txt  :
                                     streaming README.txt and reduce sort:
  0000000: 0a0a 0a0a 0a0a 0a61 6c67 6f72 6974 686d  .......algorithm
           |  0000000: *090a 090a 090a 090a 090a 090a 090a* 616c
 ..............al
  0000010: 732e 2020 5468 6520 666f 726d 2061 6e64  s.  The form and
           |  0000010: 676f 7269 7468 6d73 2e20 2054 6865 2066  gorithms.
 The f
  0000020: 206d 616e 6e65 7220 6f66 2074 6869 7320   manner of this
          |  0000020: 6f72 6d20 616e 6420 6d61 6e6e 6572 206f  orm and
manner o
  0000030: 4170 6163 6865 2053 6f66 7477 6172 6520  Apache Software
          |  0000030: 6620 7468 6973 2041 7061 6368 6520 536f  f this
Apache So
  0000040: 466f 756e 6461 7469 6f6e 0a61 6e64 206f  Foundation.and o
           |  0000040: 6674 7761 7265 2046 6f75 6e64 6174 696f  ftware
Foundatio

because there are 34 lines, so the file size add 34 of 09 byte.
1399+34=1433. why?

Best regards,
Andy