Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Issues with SAMPLE in PIG v0.8.1


+
Brian Choi 2012-09-11, 22:37
+
Cheolsoo Park 2012-09-13, 06:12
+
Thejas Nair 2012-09-14, 01:15
+
Brian Choi 2012-09-17, 03:15
Copy link to this message
-
Re: Issues with SAMPLE in PIG v0.8.1
I have used SAMPLE operator while implementing CUBE operator, where I choose sample percentage at runtime so that it always emits
around 100K tuples. I tested it from 1M to 100M tuples and it worked fine as expected. It works as expected with trunk version. I haven't tested with earlier versions.

Thanks
-- Prasanth

On Sep 16, 2012, at 11:15 PM, Brian Choi <[EMAIL PROTECTED]> wrote:

> Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing
> about this is that it approaches the correct values for SAMPLE() as you
> approach a sample size of 100% (or 0.99), but gets worse as you start
> getting to lower sample fractions.
>
>       Brian
>
>
> On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[EMAIL PROTECTED]> wrote:
>
>> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
>>
>>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
>>> nothing but "Math.random() < x" where "x" is a double.
>>>
>>> You are right. Sample operator translates in to a filter operator with
>> condition "Math.random() < x".
>>
>>
>> In my test, SAMPLE A 0.00001 returns about 10 records with a million
>>> records when running in local mode. I am curious if something can go wrong
>>> when running it in MR mode.
>>>
>>
>> I wouldn't expect different behavior in case of MR mode.
>>
>> Brian,
>> Do you see this behavior across multiple runs ?
>>
>> -Thejas
>>
>>
>>
>>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote:
>>>
>>> Hello Everyone,
>>>>
>>>>           I am wondering if anyone has run into an issue that I am
>>>> having
>>>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>>>> orignal relation.
>>>>
>>>> Assume the relation "A" contains a single column of data (int type) with
>>>> 1,000,000 records
>>>>
>>>> Asamp = SAMPLE A 0.00001;
>>>> Asamp2 = SAMPLE A 0.0001;
>>>>
>>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>>>> records, respectively. However, what I find is Asamp and Asamp2 are
>>>> closer
>>>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>>>> size. Interestingly, in the limiting case of:
>>>>
>>>> Asamp3 = SAMPLE A 0.99;
>>>>
>>>> The actual subsampled size is VERY close to the expected 99% size of the
>>>> full sample size. Can anyone shed light as to what I may be doing wrong
>>>> or
>>>> share their experiences if they have also seen issues with using SAMPLE
>>>> in
>>>> PIG. Thank you.
>>>>
>>>>            Brian
>>>>
>>>>
>>>
>>

+
Dmitriy Ryaboy 2012-09-17, 05:02
+
Brian Choi 2012-09-17, 05:13
+
Dmitriy Ryaboy 2012-09-17, 05:24