Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - how to load custom Writable class from sequence file?


Copy link to this message
-
Re: how to load custom Writable class from sequence file?
Pradeep Gollakota 2013-09-17, 01:56
Thats correct...

The "load ... AS (k:chararray, v:charrary);" doesn't actually do what you
think it does. The AS statement tell Pig what the schema types are, so it
will call the appropriate LoadCaster method to get it into the right type.
A LoadCaster object defines how to map byte[] into appropriate Pig
datatypes. If the LoadFunc is not schema aware and you don't have the
schema defined when you load, everything will be loaded as a bytearray.

The problem you have is that the custom writable isn't a Pig datatype. I
don't think you'll be able to do this without writing some custom code.
I'll take a look at the source code for the SequenceFileLoader and see if
there's a way to specify your own LoadCaster. If there is, then you'll just
have to write a custom LoadCaster and specify it in the configuration. If
not, you'll have to extend and roll out your own SequenceFileLoader.
On Mon, Sep 16, 2013 at 6:43 PM, Yang <[EMAIL PROTECTED]> wrote:

> I think my custom type has toString(), well at least writable() says it's
> writable to bytes, so supposedly if I force it to bytes or string, pig
> should be able to cast
> like
>
> load ... AS ( k:chararray, v:chararray);
>
> but this actually fails
>
>
> On Mon, Sep 16, 2013 at 6:22 PM, Pradeep Gollakota <[EMAIL PROTECTED]
> >wrote:
>
> > The problem is that pig only speaks its data types. So you need to tell
> it
> > how to translate from your custom writable to a pig datatype.
> >
> > Apparently elephant-bird has some support for doing this type of thing...
> > take a look at this SO post
> >
> >
> http://stackoverflow.com/questions/16540651/apache-pig-can-we-convert-a-custom-writable-object-to-pig-format
> >
> >
> > On Mon, Sep 16, 2013 at 5:37 PM, Yang <[EMAIL PROTECTED]> wrote:
> >
> > > I tried to do a quick and dirty inspection of some of our data feeds,
> > which
> > > are encoded in gzipped SequenceFile.
> > >
> > > basically I did
> > >
> > > a = load 'myfile' using ......SequenceFileLoader() AS ( mykey,
> myvalue);
> > >
> > > but it gave me some error:
> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
> > > 2013-09-16 17:34:28,961 [Thread-5] WARN
> > >  org.apache.pig.piggybank.storage.SequenceFileLoader - Unable to
> > translate
> > > key class com.mycompany.model.VisitKey to a Pig datatype
> > > 2013-09-16 17:34:28,962 [Thread-5] WARN
> > >  org.apache.hadoop.mapred.FileOutputCommitter - Output path is null in
> > > cleanup
> > > 2013-09-16 17:34:28,963 [Thread-5] WARN
> > >  org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
> > > org.apache.pig.backend.BackendException: ERROR 0: Unable to translate
> > class
> > > com.mycompany.model.VisitKey to a Pig datatype
> > > at
> > >
> > >
> >
> org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
> > >  at
> > >
> > >
> >
> org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:133)
> > >
> > >
> > > in the pig file, I have already REGISTERED the jar that contains the
> > class
> > >  com.mycompany.model.VisitKey
> > >
> > >
> > > if PIG doesn't work, the only other approach is probably to use some of
> > the
> > > newer "pseudo-scripting " languages like cascalog or scala
> > > thanks
> > > Yang
> > >
> >
>