-Re: Sequence file format in python and serialization
Jeremy Lewi 2011-06-02, 13:12
If you want to use complex types in a streaming job I think you need to
encode the values using the typedbytes format within the sequence file;
i.e the key and value in the sequence file are both typedbytes writable.
This is independent of the language the mapper and reducer is written in
because the values needed to be encoded as a byte stream in such a way
that the binary stream doesn't contain any characters that would cause
problems when passed in via stdin/stdout.
In python your mapper/reducer will pull in strings from stdin which can
be decoded from typedbytes to python types.
The easiest way to do this is to use dumbo
(https://github.com/klbostee/dumbo/wiki) to write your python
mapper/reducer. The dumbo module handles the
serialization/deserialization to/from typedbytes to native python types.
On Thu, 2011-06-02 at 00:06 -0700, Mapred Learn wrote:
> I have a question regarding using sequence file input format in hadoop
> streaing jar with mappers and reducers written in python.
> If i use sequence file as input format for streaming jar and use
> mappers written in python, can I take care of serialization and
> de-serialization in mapper/reducer code ? For eg, if i have complex
> data-types in sequence file's values, can I de-serialize them in
> python and run map-red job using streaming jar.
> Thanks in advance,