Dmitriy Ryaboy 2011-10-06, 15:50
One possibility is to introduce 'mode' in Pig with default value of
'strict'. Other values being 'non-strict' or potentially others. Another use
case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
PigStorage cannot guarantee all the requirements imposed by Merge Join, but
you can still use it in most cases. I dont recall all the details but
discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Hi guys,
> It seems like our 'collected' option for group is pretty limited.
> Imagine I have the following (silly example) script:
> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
> text:chararray, ts:long);
> happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
> -- get only happy ngrams, using replicated to avoid MR step
> happy_ngrams = join ngrams by ngram, happy_words by word using
> -- find only happy tweets. We know ngrams that were exploded from a single
> -- must be in the same mapper still, so in theory this should work
> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
> But this doesn't work, of course, because there's a whole mess of operators
> between the load and the group, including a join, and nothing makes any
> guarantees about (id, uid) being on the same mapper except for what the
> knows about the data.
> What's the right approach to let the user force this through?
> a) this is an edge case optimization that's more trouble than it is worth
> b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
> disable sanity checks
> c) using 'collected-its-cool-dmitriy-said-its-ok'
> d) drop the checks altogether
> e) something else?