Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> how to specify MultipleOutputs, MultipleInputs in using Avro mapred API


Copy link to this message
-
Re: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API
If you're asking about advantages of using avro in intermediates, then
this is what I've noticed so far:

Smaller intermediate outputs (avro's serialization is beautiful).
Compression with its deflate provision isn't difficult at all either.

That and raw comparators helps speed up the intermediate stages.

On Wed, Aug 18, 2010 at 10:48 PM, ey-chih chow <[EMAIL PROTECTED]> wrote:
> Thanks.  But by doing this way, what kind of advantage we can get from Avro?
> Ey-Chih
>
>> From: [EMAIL PROTECTED]
>> Date: Wed, 18 Aug 2010 19:39:17 +0530
>> Subject: Re: how to specify MultipleOutputs, MultipleInputs in using Avro
>> mapred API
>> To: [EMAIL PROTECTED]
>>
>> If I got your issue right, all you need to ensure is that both your
>> mappers emit the same "type" of keys and values out. This can easily
>> be done by implementing a custom Avro Mapper [which reads records from
>> avro files, processes them and spews out legal K/V types instead of
>> avro datums, such that they match your HBase mapper's collected
>> outputs].
>>
>> Your reducer shouldn't be bothered about avro/etc then.
>>
>> * Note: You may also use avro as intermediate K/V format, but it might
>> require some extra work to do so :)
>>
>> On Wed, Aug 18, 2010 at 6:45 PM, ey-chih chow <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> > Let me rephrase my question to see if anybody is interested in answering
>> > it.
>> >  For the new version of Avro 1.4.0, the class hierarchy of AvroMapper
>> > and
>> > AvroReducer have been changed to subclass from Configured, rather than
>> > from
>> > MapReduceBase to implement the interfaces Mapper and Reducer
>> > respectively.
>> >  The configuration of Avro mapred jobs are also different from that of
>> > the
>> > other mapred jobs.  Furthermore, text log files have to be imported to
>> > become Avro formats for Avro mapred jobs to process.  If I get a mapred
>> > job
>> > that requires a reducer-side join of a two inputs, one from HBase and
>> > the
>> > other from an imported log file with the Avro format, how can I
>> > configure
>> > the two mappers to process inputs from HBase and the log file
>> > respectively?
>> >  Also how can I configure an Avro reducer to generate multiple outputs?
>> >  For
>> > multiple inputs and outputs, I got some examples programs from Tom
>> > White's
>> > Hadoop book.  But I simply don't know what kind of changes I should make
>> > for
>> > the Avro case.
>> > Ey-Chih
>> >
>> > ________________________________
>> > From: [EMAIL PROTECTED]
>> > To: [EMAIL PROTECTED]
>> > Subject: how to specify MultipleOutputs, MultipleInputs in using Avro
>> > mapred
>> > API
>> > Date: Mon, 16 Aug 2010 18:22:24 -0700
>> >
>> > Hi,
>> > I got a Map/Reduce job that require multiple inputs and outputs.  One of
>> > the
>> > inputs will be processed by a mapper and a reducer that are subclasses
>> > of
>> > AvroMapper/AvroReducer respectively.  And the reducer has multiple
>> > outputs.
>> >  I appreciate if anybody could let me know how to configure the job to
>> > do
>> > this.
>> > Ey-Chih
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>

--
Harsh J
www.harshj.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB