Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> mapreduce and python


Copy link to this message
-
Re: mapreduce and python
Hassen,

I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes
as a solution for using python to implement mappers and reducers.

TypedBytes is a hadoop encoding format that allows binary data
(including lists and maps) to be encoded in a format that permits the
serialized data to safely be passed to mappers/reducers via the command
line through hadoop streaming.

Dumbo is a python library which makes it easy to implement your mappers
and reducers in python. In particular, it handles decoding the data
encoded as typedbytes to native python types.

J
On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
> Hassen,
>
>
> I have lots of binary data that I parse using Python streaming.
>
>
> The way I do this is stream the binary data into sequence files (the
> binary data object I save in the key and (null) as the value).
>
>
> Each key then gets written back to me line by line, key by key for an
> entire block when streaming.
>
>
> To have this work in streaming on the command line you need to
> use -inputformat SequenceFileAsTextInputFormat
>
>
> To create the sequence files I have a jar file that goes from
> BufferedReader and writes to org.apache.hadoop.io.SequenceFile.Writer
>
>
> I am not sure if you can do this for your data but if not then make
> your own InputFormat.
>
>
> good luck!
>
>
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> Twitter: @allthingshadoop
> */
>
> On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <[EMAIL PROTECTED]>
> wrote:
>         Dear all,
>        
>         Is it possible to have a binary input to a map code written in
>         python?
>        
>         Thank you
>         Hassen
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB