Brian Choi 2012-09-11, 22:37
I am puzzled about this. If I am not mistaken, the SAMPLE operator is
nothing but "Math.random() < x" where "x" is a double.
In my test, SAMPLE A 0.00001 returns about 10 records with a million
records when running in local mode. I am curious if something can go wrong
when running it in MR mode.
On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote:
> Hello Everyone,
> I am wondering if anyone has run into an issue that I am having
> using SAMPLE in a pig script to create a subsample of 0.001% from the
> orignal relation.
> Assume the relation "A" contains a single column of data (int type) with
> 1,000,000 records
> Asamp = SAMPLE A 0.00001;
> Asamp2 = SAMPLE A 0.0001;
> Asamp and Asamp2 should produce subsampled relations with 10 and 100
> records, respectively. However, what I find is Asamp and Asamp2 are closer
> to 1000 and 10000 records, which seems like a 100-fold error in sample
> size. Interestingly, in the limiting case of:
> Asamp3 = SAMPLE A 0.99;
> The actual subsampled size is VERY close to the expected 99% size of the
> full sample size. Can anyone shed light as to what I may be doing wrong or
> share their experiences if they have also seen issues with using SAMPLE in
> PIG. Thank you.
Thejas Nair 2012-09-14, 01:15
Brian Choi 2012-09-17, 03:15
Prasanth J 2012-09-17, 03:20
Dmitriy Ryaboy 2012-09-17, 05:02
Brian Choi 2012-09-17, 05:13
Dmitriy Ryaboy 2012-09-17, 05:24