Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Collecting union-ed Records in AvroReducer

Andrew Kenworthy 2011-12-08, 12:10
Gaurav Nanda 2011-12-08, 14:32
Doug Cutting 2011-12-08, 17:05
Copy link to this message
Re: Collecting union-ed Records in AvroReducer

On 12/8/11 4:10 AM, "Andrew Kenworthy" <[EMAIL PROTECTED]> wrote:
>is it possible to write/collect a union-ed record from an avro reducer?
>I have a reduce class (extending AvroReducer), and the output schema is a
>union schema of record type A and record type B. In the reduce logic I
>want to combine instances of A and B in the same datum, passing it to my
>Avrocollector. My code looks a bit like this:

If both records were created in the reducer, you can call collect twice,
once with each record.  Collect in general can be called as many times as
you wish.

If you want to combine two records into a single datum rather than emit
multiple datums, you do not want a union, you need a Record.  A union is a
single datum that may be only one of its branches in a single datum.

In short, do you want to emit both records individually or as a pair?  If
it is a pair, you need a Record, if it is multiple outputs or either/or,
it is a Union.
>Record unionRecord = new GenericData.Record(myUnionSchema); // not legal!
>unionRecord.put("type A", recordA);
>unionRecord.put("type B", recordB);
>but GenericData.Record constructor expects a Record Schema. How can I
>write both records such that they appear in the same output
> datum?

If your output is either one type or another, see Doug's answer.

for multiple datums, it is

output schema is a union of two records  (a datum is either one or the
["RecordA", "RecordB"]
then the code is:

If you want a single datum that contains both a RecordA and a RecordB you
need to have your output schema be a Record with two fields:

{"type":"record", "fields":[
  {"name":"recordA", "type":"RecordA"},
  {"name":"recordB", "type":"RecordB"}

And you would use this record schema to create the GenericRecord, and then
populate the fields with the inner records, then call collect once with
the outer record.

Another choice is to output the output be an avro array of the union type
that may have any number of RecordA and RecordB's in a single datum.

Andrew Kenworthy 2011-12-13, 10:27
Andrew Kenworthy 2011-12-08, 15:03