Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Issues with SAMPLE in PIG v0.8.1


+
Brian Choi 2012-09-11, 22:37
+
Cheolsoo Park 2012-09-13, 06:12
+
Thejas Nair 2012-09-14, 01:15
+
Brian Choi 2012-09-17, 03:15
+
Prasanth J 2012-09-17, 03:20
+
Dmitriy Ryaboy 2012-09-17, 05:02
+
Brian Choi 2012-09-17, 05:13
Copy link to this message
-
Re: Issues with SAMPLE in PIG v0.8.1
I just ran this very script three times using Pig 0.8 (svn revision
1148107) on a set of 2.5 million rows and got (2509), (2552), and
(2473) as the output.

Don't know what to tell you.. can't reproduce. Are you sure you are
running on the input you think you are running on?

Is this literally the script you ran?

On Sun, Sep 16, 2012 at 10:13 PM, Brian Choi <[EMAIL PROTECTED]> wrote:
> The PIG script would be simply as follows:
>
> UIDs = FOREACH xRelation GENERATE $0 as user_id;
> UIDsample = SAMPLE UIDs 0.001;
> UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1);
>
> where number of UIDs = ~ 2.5MM user ids
> and in this case UIDsampleCount = ~ 250,000 UIDs or records, but
> UIDsampleCount should be = ~ 2,500
>
> The version I am using is pig-0.8.1.
>
> Please let me know if there is any other information that you would like me
> to provide.
>
>         brian
>
>
> On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
>
>> Brian, could you provide a complete script that reproduces the issue?
>> What version of pig are you on?
>>
>> Thanks,
>> -D
>>
>> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <[EMAIL PROTECTED]> wrote:
>> > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest
>> thing
>> > about this is that it approaches the correct values for SAMPLE() as you
>> > approach a sample size of 100% (or 0.99), but gets worse as you start
>> > getting to lower sample fractions.
>> >
>> >        Brian
>> >
>> >
>> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[EMAIL PROTECTED]>
>> wrote:
>> >
>> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
>> >>
>> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
>> >>> nothing but "Math.random() < x" where "x" is a double.
>> >>>
>> >>>  You are right. Sample operator translates in to a filter operator with
>> >> condition "Math.random() < x".
>> >>
>> >>
>> >>  In my test, SAMPLE A 0.00001 returns about 10 records with a million
>> >>> records when running in local mode. I am curious if something can go
>> wrong
>> >>> when running it in MR mode.
>> >>>
>> >>
>> >> I wouldn't expect different behavior in case of MR mode.
>> >>
>> >> Brian,
>> >> Do you see this behavior across multiple runs ?
>> >>
>> >> -Thejas
>> >>
>> >>
>> >>
>> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote:
>> >>>
>> >>>  Hello Everyone,
>> >>>>
>> >>>>            I am wondering if anyone has run into an issue that I am
>> >>>> having
>> >>>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>> >>>> orignal relation.
>> >>>>
>> >>>> Assume the relation "A" contains a single column of data (int type)
>> with
>> >>>> 1,000,000 records
>> >>>>
>> >>>> Asamp = SAMPLE A 0.00001;
>> >>>> Asamp2 = SAMPLE A 0.0001;
>> >>>>
>> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>> >>>> records, respectively. However, what I find is Asamp and Asamp2 are
>> >>>> closer
>> >>>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>> >>>> size. Interestingly, in the limiting case of:
>> >>>>
>> >>>> Asamp3 = SAMPLE A 0.99;
>> >>>>
>> >>>> The actual subsampled size is VERY close to the expected 99% size of
>> the
>> >>>> full sample size. Can anyone shed light as to what I may be doing
>> wrong
>> >>>> or
>> >>>> share their experiences if they have also seen issues with using
>> SAMPLE
>> >>>> in
>> >>>> PIG. Thank you.
>> >>>>
>> >>>>             Brian
>> >>>>
>> >>>>
>> >>>
>> >>
>>