Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Issues with SAMPLE in PIG v0.8.1


+
Brian Choi 2012-09-11, 22:37
+
Cheolsoo Park 2012-09-13, 06:12
+
Thejas Nair 2012-09-14, 01:15
+
Brian Choi 2012-09-17, 03:15
Copy link to this message
-
Re: Issues with SAMPLE in PIG v0.8.1
I have used SAMPLE operator while implementing CUBE operator, where I choose sample percentage at runtime so that it always emits
around 100K tuples. I tested it from 1M to 100M tuples and it worked fine as expected. It works as expected with trunk version. I haven't tested with earlier versions.

Thanks
-- Prasanth

On Sep 16, 2012, at 11:15 PM, Brian Choi <[EMAIL PROTECTED]> wrote:

> Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing
> about this is that it approaches the correct values for SAMPLE() as you
> approach a sample size of 100% (or 0.99), but gets worse as you start
> getting to lower sample fractions.
>
>       Brian
>
>
> On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[EMAIL PROTECTED]> wrote:
>
>> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
>>
>>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
>>> nothing but "Math.random() < x" where "x" is a double.
>>>
>>> You are right. Sample operator translates in to a filter operator with
>> condition "Math.random() < x".
>>
>>
>> In my test, SAMPLE A 0.00001 returns about 10 records with a million
>>> records when running in local mode. I am curious if something can go wrong
>>> when running it in MR mode.
>>>
>>
>> I wouldn't expect different behavior in case of MR mode.
>>
>> Brian,
>> Do you see this behavior across multiple runs ?
>>
>> -Thejas
>>
>>
>>
>>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote:
>>>
>>> Hello Everyone,
>>>>
>>>>           I am wondering if anyone has run into an issue that I am
>>>> having
>>>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>>>> orignal relation.
>>>>
>>>> Assume the relation "A" contains a single column of data (int type) with
>>>> 1,000,000 records
>>>>
>>>> Asamp = SAMPLE A 0.00001;
>>>> Asamp2 = SAMPLE A 0.0001;
>>>>
>>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>>>> records, respectively. However, what I find is Asamp and Asamp2 are
>>>> closer
>>>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>>>> size. Interestingly, in the limiting case of:
>>>>
>>>> Asamp3 = SAMPLE A 0.99;
>>>>
>>>> The actual subsampled size is VERY close to the expected 99% size of the
>>>> full sample size. Can anyone shed light as to what I may be doing wrong
>>>> or
>>>> share their experiences if they have also seen issues with using SAMPLE
>>>> in
>>>> PIG. Thank you.
>>>>
>>>>            Brian
>>>>
>>>>
>>>
>>

+
Dmitriy Ryaboy 2012-09-17, 05:02
+
Brian Choi 2012-09-17, 05:13
+
Dmitriy Ryaboy 2012-09-17, 05:24
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB