Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Problem with streaming exact binary chunks


Copy link to this message
-
RE: Problem with streaming exact binary chunks
Hi,

The only way that I could find was to override the various InputWriter and OutputWriter classes.
as defined by the configuration settings
stream.map.input.writer.class
stream.map.output.reader.class
stream.reduce.input.writer.class
stream.reduce. output.reader.class
which was painful. Hopefully someone will tell you the _correct_ way to do this.
If not I will provide more details.

Regards,

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: [EMAIL PROTECTED]

-----Original Message-----
From: Youssef Hatem [mailto:[EMAIL PROTECTED]]
Sent: 09 October 2013 12:14
To: [EMAIL PROTECTED]
Subject: Problem with streaming exact binary chunks

Hello,

I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:

public class MyRecordReader implements
        RecordReader<BytesWritable, BytesWritable> {
    ...
    public boolean next(BytesWritable key, BytesWritable ignore)
            throws IOException {
        ...

        byte[] result = new byte[8];
        for (int i = 0; i < result.length; ++i)
            result[i] = (byte)(i+1);
        result[3] = (byte)'\n';
        result[4] = (byte)'\n';

        key.set(result, 0, result.length);
        return true;
    }
}

As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).

According to the documentation of typed bytes the mapper should receive the following byte sequence:
00 00 00 08 01 02 03 0a 0a 06 07 08

However bytes are somehow modified and I get the following sequence instead:
00 00 00 08 01 02 03 09 0a 09 0a 06 07 08

0a = '\n'
09 = '\t'

It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.

Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.

Best regards,
Youssef Hatem