Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - mapreduce and python


Copy link to this message
-
Re: mapreduce and python
Jeremy Lewi 2011-06-21, 05:13
Hassen,

I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes
as a solution for using python to implement mappers and reducers.

TypedBytes is a hadoop encoding format that allows binary data
(including lists and maps) to be encoded in a format that permits the
serialized data to safely be passed to mappers/reducers via the command
line through hadoop streaming.

Dumbo is a python library which makes it easy to implement your mappers
and reducers in python. In particular, it handles decoding the data
encoded as typedbytes to native python types.

J
On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
> Hassen,
>
>
> I have lots of binary data that I parse using Python streaming.
>
>
> The way I do this is stream the binary data into sequence files (the
> binary data object I save in the key and (null) as the value).
>
>
> Each key then gets written back to me line by line, key by key for an
> entire block when streaming.
>
>
> To have this work in streaming on the command line you need to
> use -inputformat SequenceFileAsTextInputFormat
>
>
> To create the sequence files I have a jar file that goes from
> BufferedReader and writes to org.apache.hadoop.io.SequenceFile.Writer
>
>
> I am not sure if you can do this for your data but if not then make
> your own InputFormat.
>
>
> good luck!
>
>
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> Twitter: @allthingshadoop
> */
>
> On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <[EMAIL PROTECTED]>
> wrote:
>         Dear all,
>        
>         Is it possible to have a binary input to a map code written in
>         python?
>        
>         Thank you
>         Hassen
>
>
>