Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Issues with SAMPLE in PIG v0.8.1


+
Brian Choi 2012-09-11, 22:37
+
Cheolsoo Park 2012-09-13, 06:12
Copy link to this message
-
Re: Issues with SAMPLE in PIG v0.8.1
On 9/12/12 11:12 PM, Cheolsoo Park wrote:
> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
> nothing but "Math.random() < x" where "x" is a double.
>
You are right. Sample operator translates in to a filter operator with
condition "Math.random() < x".
> In my test, SAMPLE A 0.00001 returns about 10 records with a million
> records when running in local mode. I am curious if something can go wrong
> when running it in MR mode.

I wouldn't expect different behavior in case of MR mode.

Brian,
Do you see this behavior across multiple runs ?

-Thejas
>
> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote:
>
>> Hello Everyone,
>>
>>            I am wondering if anyone has run into an issue that I am having
>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>> orignal relation.
>>
>> Assume the relation "A" contains a single column of data (int type) with
>> 1,000,000 records
>>
>> Asamp = SAMPLE A 0.00001;
>> Asamp2 = SAMPLE A 0.0001;
>>
>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>> records, respectively. However, what I find is Asamp and Asamp2 are closer
>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>> size. Interestingly, in the limiting case of:
>>
>> Asamp3 = SAMPLE A 0.99;
>>
>> The actual subsampled size is VERY close to the expected 99% size of the
>> full sample size. Can anyone shed light as to what I may be doing wrong or
>> share their experiences if they have also seen issues with using SAMPLE in
>> PIG. Thank you.
>>
>>             Brian
>>
>
+
Brian Choi 2012-09-17, 03:15
+
Prasanth J 2012-09-17, 03:20
+
Dmitriy Ryaboy 2012-09-17, 05:02
+
Brian Choi 2012-09-17, 05:13
+
Dmitriy Ryaboy 2012-09-17, 05:24