Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Possible deficiency in describe?


Copy link to this message
-
Re: Possible deficiency in describe?
BinStorage format should not change between pig versions. It is like an interface, it should not change unless there is a very strong reason.
It used to be the format used to (de)serialize data between pig stages, but when changes were made to optimize the format as part of jira PIG-1472, a new format/loader was used instead of changing BinStorage.

-Thejas
On 12/28/10 3:41 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

BinStorage is more efficient and doesn't have the trouble with nested data
representations you encountered in PigStorage. The downside is only that
it's not human-readable, and that it might change between versions of Pig
(though so far we have resisted the urge, iirc)

D

On Tue, Dec 28, 2010 at 3:24 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> Thanks. Is there any particular downside to this if you get to the millions
> and hundreds of millions of rows, or is it just the lack of simple use with
> nonpig systems?
>
> Sent via BlackBerry
>
> -----Original Message-----
> From: Dmitriy Ryaboy <[EMAIL PROTECTED]>
> Date: Tue, 28 Dec 2010 15:08:15
> To: <[EMAIL PROTECTED]>
> Reply-To: [EMAIL PROTECTED]
> Subject: Re: Possible deficiency in describe?
>
> Try using BinStorage instead of the text-based PigStorage
>
> D
>
> On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney <[EMAIL PROTECTED]
> >wrote:
>
> > So, I made a dumb little python script that parses a pig script, see's
> what
> > stores there are, and then uses pig's describe function to get the schema
> > of
> > the object being stored and then uses that info to make a new file that
> has
> > the proper loader/schema. I felt this was useful because I found myself
> > making intermediate stores, and then it being pretty difficult to make
> the
> > proper loader if there were a lot of columns (especially remembering the
> > type).
> >
> > However, it seems that the result from DESCRIBE is not adequate to do a
> > load. For example, I have test.txt which is literally just random pairs
> of
> > numbers
> >
> > ie
> >
> > 1 2
> > 1 3
> > 1 4
> > 2 5
> > 2 6
> > 3 7
> > 3 8
> > 4 9
> > 5 10
> > 6 11
> > 7 12
> > 8 13
> > 8 14
> > 8 15
> >
> > and so on.
> >
> > I do this:
> >
> > t1 = LOAD 'test.txt' AS (n1:int, n2:int);
> > t2 = GROUP t1 BY n1;
> > t3 = GROUP t2 BY group;
> >
> > DESCRIBE t3;
> > STORE t3 INTO 'output.txt';
> >
> > The query runs without a hitch, however, there is an issue
> >
> > This is what describe gives:
> >
> > t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}
> >
> > However, this won't let you load the file...
> >
> > the output has form
> > x{(y,{(a,b)}
> >
> > And I'm not really sure how to go about even creating a loader that would
> > properly load it. Suffice it to say, it seems pretty complicated to store
> > and then load anything that isn't a flat file...is this by design? Is
> there
> > an easier way to go from the schema, as per describe, to the schema you'd
> > use to load it?
> >
> > I'm curious what people do in practice. I could probably extend the
> script
> > I
> > made to go from describe schema -> loading schema (if the pig loader can
> > load things that have brackets and all that?), but I want to know what
> the
> > limitations are.
> >
> > As always, I apologize if there is an easy answer to this. Thanks.
> >
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB