Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Problem with streaming exact binary chunks

Copy link to this message
RE: Problem with streaming exact binary chunks

The only way that I could find was to override the various InputWriter and OutputWriter classes.
as defined by the configuration settings
stream.reduce. output.reader.class
which was painful. Hopefully someone will tell you the _correct_ way to do this.
If not I will provide more details.


Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699

-----Original Message-----
From: Youssef Hatem [mailto:[EMAIL PROTECTED]]
Sent: 09 October 2013 12:14
Subject: Problem with streaming exact binary chunks


I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:

public class MyRecordReader implements
        RecordReader<BytesWritable, BytesWritable> {
    public boolean next(BytesWritable key, BytesWritable ignore)
            throws IOException {

        byte[] result = new byte[8];
        for (int i = 0; i < result.length; ++i)
            result[i] = (byte)(i+1);
        result[3] = (byte)'\n';
        result[4] = (byte)'\n';

        key.set(result, 0, result.length);
        return true;

As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).

According to the documentation of typed bytes the mapper should receive the following byte sequence:
00 00 00 08 01 02 03 0a 0a 06 07 08

However bytes are somehow modified and I get the following sequence instead:
00 00 00 08 01 02 03 09 0a 09 0a 06 07 08

0a = '\n'
09 = '\t'

It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.

Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.

Best regards,
Youssef Hatem