Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Avro MapReduce (MR1): Prevent Key from being output by reducer when using Pair schema


Copy link to this message
-
Re: Avro MapReduce (MR1): Prevent Key from being output by reducer when using Pair schema
Thanks Ed! Can you also file an improvement JIRA under
https://issues.apache.org/jira/browse/AVRO with a patch that changes
it to make more sense?

On Thu, Jan 16, 2014 at 5:14 PM, ed <[EMAIL PROTECTED]> wrote:
> Hi Harsh,
>
> Thank you for your response which was invaluable in helping me to figure out
> my issue.  The Java-Doc is in fact incorrect when it states that
> AvroJob.setOutputSchema cannot accept non-Pair configs as it turns out it
> can.  What was throwing me off is that if you use AvroJob.setOutputSchema to
> set a non-Pair config, then you also need to call AvroJob.setMapOutputSchema
> (which does require the use of Pair).  Otherwise, by default, the map output
> schema gets set to whatever you set in setOutputSchema and if that is
> non-pair you'll get an error at runtime.
>
> Maybe the JavaDoc should say something along the lines of:
>
>> Configure a job's output schema. If this is a not a Pair-schema then you
>> must explicitly set the job's map output schema using setMapOutputSchema
>
>
> Thank you!
>
> Best Regards,
>
> Ed
>
>
>
>
> On Thu, Jan 16, 2014 at 6:47 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>>
>> Hello Ed,
>>
>> The AvroReducer per
>>
>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroReducer.html
>> has a simple spec of <K,V,OUT>, where OUT can be any record type and
>> not necessarily a Pair<KO,VO> type.
>>
>> AvroJob.setOutputSchema(…) should accept non-pair configs. I think its
>> java-doc is incorrect though. I wrote a test case yesterday at
>> http://issues.apache.org/jira/browse/AVRO-1439, in which I set a
>> non-Pair schema via the same call without any trouble. We could get
>> the java-doc fixed, if it is indeed wrong.
>>
>> On Thu, Jan 16, 2014 at 2:14 PM, ed <[EMAIL PROTECTED]> wrote:
>> > Hello,
>> >
>> > I am currently reading in lots of small avro files and then writing them
>> > out
>> > into one large avro file using Map Reduce MR1.  I'm trying to do this
>> > using
>> > the AvroMapper and AvroReducer and it's almost working how I want.
>> >
>> > The problem right now is that it looks like I have to use
>> > "org.apache.avro.mapred.Pair" if I use "AvroJob.setOutputSchema".  Is
>> > there
>> > a way to output a Pair schema from AvroReducer and have the "key" in
>> > that
>> > schema be ignored (i.e., not included in the output from the reducer)?
>> > Right now when I check the Reducer output there is an added field in
>> > each
>> > record called "key" which I'd like to not have there.
>> >
>> > Essentially I'm looking for something like NullWritable where the key
>> > will
>> > just be ignored in the final output.
>> >
>> > Thank you for any assistance or guidance you can provide!
>> >
>> > Best Regards,
>> >
>> > Ed
>>
>>
>>
>> --
>> Harsh J
>
>

--
Harsh J