Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Approaches to storing arbitrary schema in a sequencefile


+
Mat Kelcey 2012-09-16, 00:15
Copy link to this message
-
Re: Approaches to storing arbitrary schema in a sequencefile
We tend to write protobuf or thrift definition for complex objects,
but that introduces severe latency into the development process.
I suppose you could try something like kryo (and create a
corresponding deserializer for EB).. the core of the problem is that
you need to carry around the schema, and you probably don't want to
write it into every tuple.

D

On Sat, Sep 15, 2012 at 5:15 PM, Mat Kelcey <[EMAIL PROTECTED]> wrote:
> Hey all,
>
> I've starting using SequenceFiles more and more (in particular the
> elephant bird load and storage functions) and am wondering what's the
> best approach is for marshaling between a schema from pig (which can
> have some arbitrary number of fields) and a sequence files (which must
> have two fields; key and value).
>
> When I've got two fields its trivial...
>
>  %declare SEQFILE_STORAGE
> 'com.twitter.elephantbird.pig.store.SequenceFileStorage';
>  %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
>  %declare LONG_CONVERTER
> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
>  a = load 'x' as (f1:chararray, f2:chararray);
>  store a into 'y' using $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c
> $TEXT_CONVERTER');
>
> but what's the best way to handle something with 3+ fields?
>
>  a = load 'x' as (f1:chararray, f2:chararray, f3:chararray);
>
> I can see two options...
> 1) A simple writeable convertor to convert to something like f1 and a
> composite f2, f3 field
> 2) Packing the fields myself using something like "a = foreach a
> generate f1, TOTUPLE(f2, f3)"
>
> But both are super clumsy and require unpacking when i reread things.
>
> Am I missing something obvious here?
>
> Cheers,
> Mat
+
Mat Kelcey 2012-09-16, 03:26