Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Issues with SAMPLE in PIG v0.8.1


Copy link to this message
-
Re: Issues with SAMPLE in PIG v0.8.1
On 9/12/12 11:12 PM, Cheolsoo Park wrote:
> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
> nothing but "Math.random() < x" where "x" is a double.
>
You are right. Sample operator translates in to a filter operator with
condition "Math.random() < x".
> In my test, SAMPLE A 0.00001 returns about 10 records with a million
> records when running in local mode. I am curious if something can go wrong
> when running it in MR mode.

I wouldn't expect different behavior in case of MR mode.

Brian,
Do you see this behavior across multiple runs ?

-Thejas
>
> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote:
>
>> Hello Everyone,
>>
>>            I am wondering if anyone has run into an issue that I am having
>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>> orignal relation.
>>
>> Assume the relation "A" contains a single column of data (int type) with
>> 1,000,000 records
>>
>> Asamp = SAMPLE A 0.00001;
>> Asamp2 = SAMPLE A 0.0001;
>>
>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>> records, respectively. However, what I find is Asamp and Asamp2 are closer
>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>> size. Interestingly, in the limiting case of:
>>
>> Asamp3 = SAMPLE A 0.99;
>>
>> The actual subsampled size is VERY close to the expected 99% size of the
>> full sample size. Can anyone shed light as to what I may be doing wrong or
>> share their experiences if they have also seen issues with using SAMPLE in
>> PIG. Thank you.
>>
>>             Brian
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB