Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - how to load custom Writable class from sequence file?


Copy link to this message
-
Re: how to load custom Writable class from sequence file?
Yang 2013-09-24, 17:51
thanks for bringing up scalding. I actually didn't know that, and meant to
use scala as an "easier and quick -and -dirty java". but yes, scalding
seems more suited for this.
On Tue, Sep 24, 2013 at 2:22 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> I assume by scala you mean scalding?
> If so, yeah, scalding should be much easier for working with custom data
> types.
>
> Pig doesn't handle generic "objects" well. You have to write converters to
> and from, like the ones we created in ElephantBird for Protocol Buffers and
> Thrift (and a bunch of writables, as Pradeep pointed out).
>
> D
>
>
> On Tue, Sep 17, 2013 at 9:20 AM, Yang <[EMAIL PROTECTED]> wrote:
>
> > Thanks Pradeep.
> >
> > it seems in this case just using scala/cascalog is easier for my
> purposes.
> > I tried out scala yesterday, works fine for me in local mode
> >
> >
> > On Mon, Sep 16, 2013 at 7:47 PM, Pradeep Gollakota <[EMAIL PROTECTED]
> > >wrote:
> >
> > > It doesn't look like the SequenceFileLoader from the piggybank has much
> > > support. The elephant bird version looks like it does what you need it
> to
> > > do.
> > >
> > >
> >
> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/SequenceFileLoader.java
> > >
> > > You'll have to write the converters from your types to Pig data types
> and
> > > pass it into the constructor of the SequenceFileLoader.
> > >
> > > Hope this helps!
> > >
> > >
> > > On Mon, Sep 16, 2013 at 6:56 PM, Pradeep Gollakota <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Thats correct...
> > > >
> > > > The "load ... AS (k:chararray, v:charrary);" doesn't actually do what
> > you
> > > > think it does. The AS statement tell Pig what the schema types are,
> so
> > it
> > > > will call the appropriate LoadCaster method to get it into the right
> > > type.
> > > > A LoadCaster object defines how to map byte[] into appropriate Pig
> > > > datatypes. If the LoadFunc is not schema aware and you don't have the
> > > > schema defined when you load, everything will be loaded as a
> bytearray.
> > > >
> > > > The problem you have is that the custom writable isn't a Pig
> datatype.
> > I
> > > > don't think you'll be able to do this without writing some custom
> code.
> > > > I'll take a look at the source code for the SequenceFileLoader and
> see
> > if
> > > > there's a way to specify your own LoadCaster. If there is, then
> you'll
> > > just
> > > > have to write a custom LoadCaster and specify it in the
> configuration.
> > If
> > > > not, you'll have to extend and roll out your own SequenceFileLoader.
> > > >
> > > >
> > > > On Mon, Sep 16, 2013 at 6:43 PM, Yang <[EMAIL PROTECTED]> wrote:
> > > >
> > > >> I think my custom type has toString(), well at least writable() says
> > > it's
> > > >> writable to bytes, so supposedly if I force it to bytes or string,
> pig
> > > >> should be able to cast
> > > >> like
> > > >>
> > > >> load ... AS ( k:chararray, v:chararray);
> > > >>
> > > >> but this actually fails
> > > >>
> > > >>
> > > >> On Mon, Sep 16, 2013 at 6:22 PM, Pradeep Gollakota <
> > > [EMAIL PROTECTED]
> > > >> >wrote:
> > > >>
> > > >> > The problem is that pig only speaks its data types. So you need to
> > > tell
> > > >> it
> > > >> > how to translate from your custom writable to a pig datatype.
> > > >> >
> > > >> > Apparently elephant-bird has some support for doing this type of
> > > >> thing...
> > > >> > take a look at this SO post
> > > >> >
> > > >> >
> > > >>
> > >
> >
> http://stackoverflow.com/questions/16540651/apache-pig-can-we-convert-a-custom-writable-object-to-pig-format
> > > >> >
> > > >> >
> > > >> > On Mon, Sep 16, 2013 at 5:37 PM, Yang <[EMAIL PROTECTED]>
> > wrote:
> > > >> >
> > > >> > > I tried to do a quick and dirty inspection of some of our data
> > > feeds,
> > > >> > which
> > > >> > > are encoded in gzipped SequenceFile.
> > > >> > >
> > > >> > > basically I did
> > > >> > >
> > > >> > > a = load 'myfile' using ......SequenceFileLoader() AS ( mykey,