|
|
-
Approaches to storing arbitrary schema in a sequencefile
Mat Kelcey 2012-09-16, 00:15
Hey all,
I've starting using SequenceFiles more and more (in particular the elephant bird load and storage functions) and am wondering what's the best approach is for marshaling between a schema from pig (which can have some arbitrary number of fields) and a sequence files (which must have two fields; key and value).
When I've got two fields its trivial...
%declare SEQFILE_STORAGE 'com.twitter.elephantbird.pig.store.SequenceFileStorage'; %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter'; %declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter'; a = load 'x' as (f1:chararray, f2:chararray); store a into 'y' using $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c $TEXT_CONVERTER');
but what's the best way to handle something with 3+ fields?
a = load 'x' as (f1:chararray, f2:chararray, f3:chararray);
I can see two options... 1) A simple writeable convertor to convert to something like f1 and a composite f2, f3 field 2) Packing the fields myself using something like "a = foreach a generate f1, TOTUPLE(f2, f3)"
But both are super clumsy and require unpacking when i reread things.
Am I missing something obvious here?
Cheers, Mat
-
Re: Approaches to storing arbitrary schema in a sequencefile
Dmitriy Ryaboy 2012-09-16, 02:44
We tend to write protobuf or thrift definition for complex objects, but that introduces severe latency into the development process. I suppose you could try something like kryo (and create a corresponding deserializer for EB).. the core of the problem is that you need to carry around the schema, and you probably don't want to write it into every tuple.
D
On Sat, Sep 15, 2012 at 5:15 PM, Mat Kelcey <[EMAIL PROTECTED]> wrote: > Hey all, > > I've starting using SequenceFiles more and more (in particular the > elephant bird load and storage functions) and am wondering what's the > best approach is for marshaling between a schema from pig (which can > have some arbitrary number of fields) and a sequence files (which must > have two fields; key and value). > > When I've got two fields its trivial... > > %declare SEQFILE_STORAGE > 'com.twitter.elephantbird.pig.store.SequenceFileStorage'; > %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter'; > %declare LONG_CONVERTER > 'com.twitter.elephantbird.pig.util.LongWritableConverter'; > a = load 'x' as (f1:chararray, f2:chararray); > store a into 'y' using $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c > $TEXT_CONVERTER'); > > but what's the best way to handle something with 3+ fields? > > a = load 'x' as (f1:chararray, f2:chararray, f3:chararray); > > I can see two options... > 1) A simple writeable convertor to convert to something like f1 and a > composite f2, f3 field > 2) Packing the fields myself using something like "a = foreach a > generate f1, TOTUPLE(f2, f3)" > > But both are super clumsy and require unpacking when i reread things. > > Am I missing something obvious here? > > Cheers, > Mat
-
Re: Approaches to storing arbitrary schema in a sequencefile
Mat Kelcey 2012-09-16, 03:26
I guess I was looking for a quick win for a simple flat schema; a serialisation format feels a bit of overkill for what I'm doing. I might be able to just JSON my way out of this specific problem... Cheers! Mat
On 15 September 2012 19:44, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > We tend to write protobuf or thrift definition for complex objects, > but that introduces severe latency into the development process. > I suppose you could try something like kryo (and create a > corresponding deserializer for EB).. the core of the problem is that > you need to carry around the schema, and you probably don't want to > write it into every tuple. > > D > > On Sat, Sep 15, 2012 at 5:15 PM, Mat Kelcey <[EMAIL PROTECTED]> wrote: >> Hey all, >> >> I've starting using SequenceFiles more and more (in particular the >> elephant bird load and storage functions) and am wondering what's the >> best approach is for marshaling between a schema from pig (which can >> have some arbitrary number of fields) and a sequence files (which must >> have two fields; key and value). >> >> When I've got two fields its trivial... >> >> %declare SEQFILE_STORAGE >> 'com.twitter.elephantbird.pig.store.SequenceFileStorage'; >> %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter'; >> %declare LONG_CONVERTER >> 'com.twitter.elephantbird.pig.util.LongWritableConverter'; >> a = load 'x' as (f1:chararray, f2:chararray); >> store a into 'y' using $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c >> $TEXT_CONVERTER'); >> >> but what's the best way to handle something with 3+ fields? >> >> a = load 'x' as (f1:chararray, f2:chararray, f3:chararray); >> >> I can see two options... >> 1) A simple writeable convertor to convert to something like f1 and a >> composite f2, f3 field >> 2) Packing the fields myself using something like "a = foreach a >> generate f1, TOTUPLE(f2, f3)" >> >> But both are super clumsy and require unpacking when i reread things. >> >> Am I missing something obvious here? >> >> Cheers, >> Mat
|
|