Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Using 'collected' group


Copy link to this message
-
Re: Using 'collected' group
One possibility is to introduce 'mode' in Pig with default value of
'strict'. Other values being 'non-strict' or potentially others. Another use
case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
PigStorage cannot guarantee all the requirements imposed by Merge Join, but
you can still use it in most cases. I dont recall all the details but
discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518

Ashutosh
On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> Hi guys,
> It seems like our 'collected' option for group is pretty limited.
> Imagine I have the following (silly example) script:
>
> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
> text:chararray, ts:long);
> happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
>
> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
> (ngram:chararray);
>
> -- get only happy ngrams, using replicated to avoid MR step
> happy_ngrams = join ngrams by ngram, happy_words by word using
> 'replicated';
>
> -- find only happy tweets. We know ngrams that were exploded from a single
> tweet
> -- must be in the same mapper still, so in theory this should work
> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>
>
> But this doesn't work, of course, because there's a whole mess of operators
> between the load and the group, including a join, and nothing makes any
> guarantees about (id, uid) being on the same mapper except for what the
> user
> knows about the data.
>
> What's the right approach to let the user force this through?
> a) this is an edge case optimization that's more trouble than it is worth
> b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
> disable sanity checks
> c) using 'collected-its-cool-dmitriy-said-its-ok'
> d) drop the checks altogether
> e) something else?
>
> D
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB