Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Approaches to storing arbitrary schema in a sequencefile

Copy link to this message
Re: Approaches to storing arbitrary schema in a sequencefile
We tend to write protobuf or thrift definition for complex objects,
but that introduces severe latency into the development process.
I suppose you could try something like kryo (and create a
corresponding deserializer for EB).. the core of the problem is that
you need to carry around the schema, and you probably don't want to
write it into every tuple.


On Sat, Sep 15, 2012 at 5:15 PM, Mat Kelcey <[EMAIL PROTECTED]> wrote:
> Hey all,
> I've starting using SequenceFiles more and more (in particular the
> elephant bird load and storage functions) and am wondering what's the
> best approach is for marshaling between a schema from pig (which can
> have some arbitrary number of fields) and a sequence files (which must
> have two fields; key and value).
> When I've got two fields its trivial...
> 'com.twitter.elephantbird.pig.store.SequenceFileStorage';
>  %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
>  %declare LONG_CONVERTER
> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
>  a = load 'x' as (f1:chararray, f2:chararray);
>  store a into 'y' using $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c
> but what's the best way to handle something with 3+ fields?
>  a = load 'x' as (f1:chararray, f2:chararray, f3:chararray);
> I can see two options...
> 1) A simple writeable convertor to convert to something like f1 and a
> composite f2, f3 field
> 2) Packing the fields myself using something like "a = foreach a
> generate f1, TOTUPLE(f2, f3)"
> But both are super clumsy and require unpacking when i reread things.
> Am I missing something obvious here?
> Cheers,
> Mat