Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Possible deficiency in describe?


+
Jonathan Coveney 2010-12-28, 22:08
+
Dmitriy Ryaboy 2010-12-28, 23:08
+
Jonathan Coveney 2010-12-28, 23:24
+
Dmitriy Ryaboy 2010-12-28, 23:41
Copy link to this message
-
Re: Possible deficiency in describe?
BinStorage format should not change between pig versions. It is like an interface, it should not change unless there is a very strong reason.
It used to be the format used to (de)serialize data between pig stages, but when changes were made to optimize the format as part of jira PIG-1472, a new format/loader was used instead of changing BinStorage.

-Thejas
On 12/28/10 3:41 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

BinStorage is more efficient and doesn't have the trouble with nested data
representations you encountered in PigStorage. The downside is only that
it's not human-readable, and that it might change between versions of Pig
(though so far we have resisted the urge, iirc)

D

On Tue, Dec 28, 2010 at 3:24 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> Thanks. Is there any particular downside to this if you get to the millions
> and hundreds of millions of rows, or is it just the lack of simple use with
> nonpig systems?
>
> Sent via BlackBerry
>
> -----Original Message-----
> From: Dmitriy Ryaboy <[EMAIL PROTECTED]>
> Date: Tue, 28 Dec 2010 15:08:15
> To: <[EMAIL PROTECTED]>
> Reply-To: [EMAIL PROTECTED]
> Subject: Re: Possible deficiency in describe?
>
> Try using BinStorage instead of the text-based PigStorage
>
> D
>
> On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney <[EMAIL PROTECTED]
> >wrote:
>
> > So, I made a dumb little python script that parses a pig script, see's
> what
> > stores there are, and then uses pig's describe function to get the schema
> > of
> > the object being stored and then uses that info to make a new file that
> has
> > the proper loader/schema. I felt this was useful because I found myself
> > making intermediate stores, and then it being pretty difficult to make
> the
> > proper loader if there were a lot of columns (especially remembering the
> > type).
> >
> > However, it seems that the result from DESCRIBE is not adequate to do a
> > load. For example, I have test.txt which is literally just random pairs
> of
> > numbers
> >
> > ie
> >
> > 1 2
> > 1 3
> > 1 4
> > 2 5
> > 2 6
> > 3 7
> > 3 8
> > 4 9
> > 5 10
> > 6 11
> > 7 12
> > 8 13
> > 8 14
> > 8 15
> >
> > and so on.
> >
> > I do this:
> >
> > t1 = LOAD 'test.txt' AS (n1:int, n2:int);
> > t2 = GROUP t1 BY n1;
> > t3 = GROUP t2 BY group;
> >
> > DESCRIBE t3;
> > STORE t3 INTO 'output.txt';
> >
> > The query runs without a hitch, however, there is an issue
> >
> > This is what describe gives:
> >
> > t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}
> >
> > However, this won't let you load the file...
> >
> > the output has form
> > x{(y,{(a,b)}
> >
> > And I'm not really sure how to go about even creating a loader that would
> > properly load it. Suffice it to say, it seems pretty complicated to store
> > and then load anything that isn't a flat file...is this by design? Is
> there
> > an easier way to go from the schema, as per describe, to the schema you'd
> > use to load it?
> >
> > I'm curious what people do in practice. I could probably extend the
> script
> > I
> > made to go from describe schema -> loading schema (if the pig loader can
> > load things that have brackets and all that?), but I want to know what
> the
> > limitations are.
> >
> > As always, I apologize if there is an easy answer to this. Thanks.
> >
>
>