Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Re: Using 'collected' group


Copy link to this message
-
Re: Using 'collected' group
I would vote for option C - i would like the user to sign off in each
place the feature is used.

pig scripts will be modified over time, and person making the edit might
not notice that the checks are turned off elsewhere in the script. If it
is set in a properties file, it could get inadvertently used. I think
dealing with incorrect results is too expensive, and justifies this.

-Thejas
On 10/7/11 8:23 AM, Alan Gates wrote:
> I would vote for Dmitriy's original option b, on a per feature basis.  I know per feature switches are more cumbersome, but a "turn off all sanity checks" option is dangerous.  When removing safeties it seems better to do it one at a time.
>
> Alan.
>
> On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote:
>
>> Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is
>> totally backwards, since if you are a dummy, the last thing you will do is
>> use a little-known parameter to protect yourself... but I digress.
>>
>> Being able to set safety valves per-script seems like a good idea. Make it
>> global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
>> etc?)
>>
>> D
>>
>> On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan<[EMAIL PROTECTED]>wrote:
>>
>>> One possibility is to introduce 'mode' in Pig with default value of
>>> 'strict'. Other values being 'non-strict' or potentially others. Another
>>> use
>>> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
>>> PigStorage cannot guarantee all the requirements imposed by Merge Join, but
>>> you can still use it in most cases. I dont recall all the details but
>>> discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
>>>
>>> Ashutosh
>>> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy<[EMAIL PROTECTED]>  wrote:
>>>
>>>> Hi guys,
>>>> It seems like our 'collected' option for group is pretty limited.
>>>> Imagine I have the following (silly example) script:
>>>>
>>>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
>>>> text:chararray, ts:long);
>>>> happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
>>>>
>>>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
>>>> (ngram:chararray);
>>>>
>>>> -- get only happy ngrams, using replicated to avoid MR step
>>>> happy_ngrams = join ngrams by ngram, happy_words by word using
>>>> 'replicated';
>>>>
>>>> -- find only happy tweets. We know ngrams that were exploded from a
>>> single
>>>> tweet
>>>> -- must be in the same mapper still, so in theory this should work
>>>> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>>>>
>>>>
>>>> But this doesn't work, of course, because there's a whole mess of
>>> operators
>>>> between the load and the group, including a join, and nothing makes any
>>>> guarantees about (id, uid) being on the same mapper except for what the
>>>> user
>>>> knows about the data.
>>>>
>>>> What's the right approach to let the user force this through?
>>>> a) this is an edge case optimization that's more trouble than it is worth
>>>> b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
>>>> disable sanity checks
>>>> c) using 'collected-its-cool-dmitriy-said-its-ok'
>>>> d) drop the checks altogether
>>>> e) something else?
>>>>
>>>> D
>>>>
>>>
>