|
|
-
Issues with SAMPLE in PIG v0.8.1
Brian Choi 2012-09-11, 22:37
Hello Everyone,
I am wondering if anyone has run into an issue that I am having using SAMPLE in a pig script to create a subsample of 0.001% from the orignal relation.
Assume the relation "A" contains a single column of data (int type) with 1,000,000 records
Asamp = SAMPLE A 0.00001; Asamp2 = SAMPLE A 0.0001;
Asamp and Asamp2 should produce subsampled relations with 10 and 100 records, respectively. However, what I find is Asamp and Asamp2 are closer to 1000 and 10000 records, which seems like a 100-fold error in sample size. Interestingly, in the limiting case of:
Asamp3 = SAMPLE A 0.99;
The actual subsampled size is VERY close to the expected 99% size of the full sample size. Can anyone shed light as to what I may be doing wrong or share their experiences if they have also seen issues with using SAMPLE in PIG. Thank you.
Brian
-
Re: Issues with SAMPLE in PIG v0.8.1
Cheolsoo Park 2012-09-13, 06:12
I am puzzled about this. If I am not mistaken, the SAMPLE operator is nothing but "Math.random() < x" where "x" is a double.
In my test, SAMPLE A 0.00001 returns about 10 records with a million records when running in local mode. I am curious if something can go wrong when running it in MR mode.
On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote:
> Hello Everyone, > > I am wondering if anyone has run into an issue that I am having > using SAMPLE in a pig script to create a subsample of 0.001% from the > orignal relation. > > Assume the relation "A" contains a single column of data (int type) with > 1,000,000 records > > Asamp = SAMPLE A 0.00001; > Asamp2 = SAMPLE A 0.0001; > > Asamp and Asamp2 should produce subsampled relations with 10 and 100 > records, respectively. However, what I find is Asamp and Asamp2 are closer > to 1000 and 10000 records, which seems like a 100-fold error in sample > size. Interestingly, in the limiting case of: > > Asamp3 = SAMPLE A 0.99; > > The actual subsampled size is VERY close to the expected 99% size of the > full sample size. Can anyone shed light as to what I may be doing wrong or > share their experiences if they have also seen issues with using SAMPLE in > PIG. Thank you. > > Brian >
-
Re: Issues with SAMPLE in PIG v0.8.1
Thejas Nair 2012-09-14, 01:15
On 9/12/12 11:12 PM, Cheolsoo Park wrote: > I am puzzled about this. If I am not mistaken, the SAMPLE operator is > nothing but "Math.random() < x" where "x" is a double. > You are right. Sample operator translates in to a filter operator with condition "Math.random() < x". > In my test, SAMPLE A 0.00001 returns about 10 records with a million > records when running in local mode. I am curious if something can go wrong > when running it in MR mode.
I wouldn't expect different behavior in case of MR mode.
Brian, Do you see this behavior across multiple runs ?
-Thejas > > On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote: > >> Hello Everyone, >> >> I am wondering if anyone has run into an issue that I am having >> using SAMPLE in a pig script to create a subsample of 0.001% from the >> orignal relation. >> >> Assume the relation "A" contains a single column of data (int type) with >> 1,000,000 records >> >> Asamp = SAMPLE A 0.00001; >> Asamp2 = SAMPLE A 0.0001; >> >> Asamp and Asamp2 should produce subsampled relations with 10 and 100 >> records, respectively. However, what I find is Asamp and Asamp2 are closer >> to 1000 and 10000 records, which seems like a 100-fold error in sample >> size. Interestingly, in the limiting case of: >> >> Asamp3 = SAMPLE A 0.99; >> >> The actual subsampled size is VERY close to the expected 99% size of the >> full sample size. Can anyone shed light as to what I may be doing wrong or >> share their experiences if they have also seen issues with using SAMPLE in >> PIG. Thank you. >> >> Brian >> >
-
Re: Issues with SAMPLE in PIG v0.8.1
Brian Choi 2012-09-17, 03:15
Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing about this is that it approaches the correct values for SAMPLE() as you approach a sample size of 100% (or 0.99), but gets worse as you start getting to lower sample fractions.
Brian On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[EMAIL PROTECTED]> wrote:
> On 9/12/12 11:12 PM, Cheolsoo Park wrote: > >> I am puzzled about this. If I am not mistaken, the SAMPLE operator is >> nothing but "Math.random() < x" where "x" is a double. >> >> You are right. Sample operator translates in to a filter operator with > condition "Math.random() < x". > > > In my test, SAMPLE A 0.00001 returns about 10 records with a million >> records when running in local mode. I am curious if something can go wrong >> when running it in MR mode. >> > > I wouldn't expect different behavior in case of MR mode. > > Brian, > Do you see this behavior across multiple runs ? > > -Thejas > > > >> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote: >> >> Hello Everyone, >>> >>> I am wondering if anyone has run into an issue that I am >>> having >>> using SAMPLE in a pig script to create a subsample of 0.001% from the >>> orignal relation. >>> >>> Assume the relation "A" contains a single column of data (int type) with >>> 1,000,000 records >>> >>> Asamp = SAMPLE A 0.00001; >>> Asamp2 = SAMPLE A 0.0001; >>> >>> Asamp and Asamp2 should produce subsampled relations with 10 and 100 >>> records, respectively. However, what I find is Asamp and Asamp2 are >>> closer >>> to 1000 and 10000 records, which seems like a 100-fold error in sample >>> size. Interestingly, in the limiting case of: >>> >>> Asamp3 = SAMPLE A 0.99; >>> >>> The actual subsampled size is VERY close to the expected 99% size of the >>> full sample size. Can anyone shed light as to what I may be doing wrong >>> or >>> share their experiences if they have also seen issues with using SAMPLE >>> in >>> PIG. Thank you. >>> >>> Brian >>> >>> >> >
-
Re: Issues with SAMPLE in PIG v0.8.1
Prasanth J 2012-09-17, 03:20
I have used SAMPLE operator while implementing CUBE operator, where I choose sample percentage at runtime so that it always emits around 100K tuples. I tested it from 1M to 100M tuples and it worked fine as expected. It works as expected with trunk version. I haven't tested with earlier versions.
Thanks -- Prasanth
On Sep 16, 2012, at 11:15 PM, Brian Choi <[EMAIL PROTECTED]> wrote:
> Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing > about this is that it approaches the correct values for SAMPLE() as you > approach a sample size of 100% (or 0.99), but gets worse as you start > getting to lower sample fractions. > > Brian > > > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: > >> On 9/12/12 11:12 PM, Cheolsoo Park wrote: >> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is >>> nothing but "Math.random() < x" where "x" is a double. >>> >>> You are right. Sample operator translates in to a filter operator with >> condition "Math.random() < x". >> >> >> In my test, SAMPLE A 0.00001 returns about 10 records with a million >>> records when running in local mode. I am curious if something can go wrong >>> when running it in MR mode. >>> >> >> I wouldn't expect different behavior in case of MR mode. >> >> Brian, >> Do you see this behavior across multiple runs ? >> >> -Thejas >> >> >> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote: >>> >>> Hello Everyone, >>>> >>>> I am wondering if anyone has run into an issue that I am >>>> having >>>> using SAMPLE in a pig script to create a subsample of 0.001% from the >>>> orignal relation. >>>> >>>> Assume the relation "A" contains a single column of data (int type) with >>>> 1,000,000 records >>>> >>>> Asamp = SAMPLE A 0.00001; >>>> Asamp2 = SAMPLE A 0.0001; >>>> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100 >>>> records, respectively. However, what I find is Asamp and Asamp2 are >>>> closer >>>> to 1000 and 10000 records, which seems like a 100-fold error in sample >>>> size. Interestingly, in the limiting case of: >>>> >>>> Asamp3 = SAMPLE A 0.99; >>>> >>>> The actual subsampled size is VERY close to the expected 99% size of the >>>> full sample size. Can anyone shed light as to what I may be doing wrong >>>> or >>>> share their experiences if they have also seen issues with using SAMPLE >>>> in >>>> PIG. Thank you. >>>> >>>> Brian >>>> >>>> >>> >>
-
Re: Issues with SAMPLE in PIG v0.8.1
Dmitriy Ryaboy 2012-09-17, 05:02
Brian, could you provide a complete script that reproduces the issue? What version of pig are you on?
Thanks, -D
On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <[EMAIL PROTECTED]> wrote: > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing > about this is that it approaches the correct values for SAMPLE() as you > approach a sample size of 100% (or 0.99), but gets worse as you start > getting to lower sample fractions. > > Brian > > > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: > >> On 9/12/12 11:12 PM, Cheolsoo Park wrote: >> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is >>> nothing but "Math.random() < x" where "x" is a double. >>> >>> You are right. Sample operator translates in to a filter operator with >> condition "Math.random() < x". >> >> >> In my test, SAMPLE A 0.00001 returns about 10 records with a million >>> records when running in local mode. I am curious if something can go wrong >>> when running it in MR mode. >>> >> >> I wouldn't expect different behavior in case of MR mode. >> >> Brian, >> Do you see this behavior across multiple runs ? >> >> -Thejas >> >> >> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote: >>> >>> Hello Everyone, >>>> >>>> I am wondering if anyone has run into an issue that I am >>>> having >>>> using SAMPLE in a pig script to create a subsample of 0.001% from the >>>> orignal relation. >>>> >>>> Assume the relation "A" contains a single column of data (int type) with >>>> 1,000,000 records >>>> >>>> Asamp = SAMPLE A 0.00001; >>>> Asamp2 = SAMPLE A 0.0001; >>>> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100 >>>> records, respectively. However, what I find is Asamp and Asamp2 are >>>> closer >>>> to 1000 and 10000 records, which seems like a 100-fold error in sample >>>> size. Interestingly, in the limiting case of: >>>> >>>> Asamp3 = SAMPLE A 0.99; >>>> >>>> The actual subsampled size is VERY close to the expected 99% size of the >>>> full sample size. Can anyone shed light as to what I may be doing wrong >>>> or >>>> share their experiences if they have also seen issues with using SAMPLE >>>> in >>>> PIG. Thank you. >>>> >>>> Brian >>>> >>>> >>> >>
-
Re: Issues with SAMPLE in PIG v0.8.1
Brian Choi 2012-09-17, 05:13
The PIG script would be simply as follows:
UIDs = FOREACH xRelation GENERATE $0 as user_id; UIDsample = SAMPLE UIDs 0.001; UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1);
where number of UIDs = ~ 2.5MM user ids and in this case UIDsampleCount = ~ 250,000 UIDs or records, but UIDsampleCount should be = ~ 2,500
The version I am using is pig-0.8.1.
Please let me know if there is any other information that you would like me to provide.
brian On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Brian, could you provide a complete script that reproduces the issue? > What version of pig are you on? > > Thanks, > -D > > On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <[EMAIL PROTECTED]> wrote: > > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest > thing > > about this is that it approaches the correct values for SAMPLE() as you > > approach a sample size of 100% (or 0.99), but gets worse as you start > > getting to lower sample fractions. > > > > Brian > > > > > > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[EMAIL PROTECTED]> > wrote: > > > >> On 9/12/12 11:12 PM, Cheolsoo Park wrote: > >> > >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is > >>> nothing but "Math.random() < x" where "x" is a double. > >>> > >>> You are right. Sample operator translates in to a filter operator with > >> condition "Math.random() < x". > >> > >> > >> In my test, SAMPLE A 0.00001 returns about 10 records with a million > >>> records when running in local mode. I am curious if something can go > wrong > >>> when running it in MR mode. > >>> > >> > >> I wouldn't expect different behavior in case of MR mode. > >> > >> Brian, > >> Do you see this behavior across multiple runs ? > >> > >> -Thejas > >> > >> > >> > >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote: > >>> > >>> Hello Everyone, > >>>> > >>>> I am wondering if anyone has run into an issue that I am > >>>> having > >>>> using SAMPLE in a pig script to create a subsample of 0.001% from the > >>>> orignal relation. > >>>> > >>>> Assume the relation "A" contains a single column of data (int type) > with > >>>> 1,000,000 records > >>>> > >>>> Asamp = SAMPLE A 0.00001; > >>>> Asamp2 = SAMPLE A 0.0001; > >>>> > >>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100 > >>>> records, respectively. However, what I find is Asamp and Asamp2 are > >>>> closer > >>>> to 1000 and 10000 records, which seems like a 100-fold error in sample > >>>> size. Interestingly, in the limiting case of: > >>>> > >>>> Asamp3 = SAMPLE A 0.99; > >>>> > >>>> The actual subsampled size is VERY close to the expected 99% size of > the > >>>> full sample size. Can anyone shed light as to what I may be doing > wrong > >>>> or > >>>> share their experiences if they have also seen issues with using > SAMPLE > >>>> in > >>>> PIG. Thank you. > >>>> > >>>> Brian > >>>> > >>>> > >>> > >> >
-
Re: Issues with SAMPLE in PIG v0.8.1
Dmitriy Ryaboy 2012-09-17, 05:24
I just ran this very script three times using Pig 0.8 (svn revision 1148107) on a set of 2.5 million rows and got (2509), (2552), and (2473) as the output.
Don't know what to tell you.. can't reproduce. Are you sure you are running on the input you think you are running on?
Is this literally the script you ran?
On Sun, Sep 16, 2012 at 10:13 PM, Brian Choi <[EMAIL PROTECTED]> wrote: > The PIG script would be simply as follows: > > UIDs = FOREACH xRelation GENERATE $0 as user_id; > UIDsample = SAMPLE UIDs 0.001; > UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1); > > where number of UIDs = ~ 2.5MM user ids > and in this case UIDsampleCount = ~ 250,000 UIDs or records, but > UIDsampleCount should be = ~ 2,500 > > The version I am using is pig-0.8.1. > > Please let me know if there is any other information that you would like me > to provide. > > brian > > > On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> Brian, could you provide a complete script that reproduces the issue? >> What version of pig are you on? >> >> Thanks, >> -D >> >> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <[EMAIL PROTECTED]> wrote: >> > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest >> thing >> > about this is that it approaches the correct values for SAMPLE() as you >> > approach a sample size of 100% (or 0.99), but gets worse as you start >> > getting to lower sample fractions. >> > >> > Brian >> > >> > >> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[EMAIL PROTECTED]> >> wrote: >> > >> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote: >> >> >> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is >> >>> nothing but "Math.random() < x" where "x" is a double. >> >>> >> >>> You are right. Sample operator translates in to a filter operator with >> >> condition "Math.random() < x". >> >> >> >> >> >> In my test, SAMPLE A 0.00001 returns about 10 records with a million >> >>> records when running in local mode. I am curious if something can go >> wrong >> >>> when running it in MR mode. >> >>> >> >> >> >> I wouldn't expect different behavior in case of MR mode. >> >> >> >> Brian, >> >> Do you see this behavior across multiple runs ? >> >> >> >> -Thejas >> >> >> >> >> >> >> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[EMAIL PROTECTED]> wrote: >> >>> >> >>> Hello Everyone, >> >>>> >> >>>> I am wondering if anyone has run into an issue that I am >> >>>> having >> >>>> using SAMPLE in a pig script to create a subsample of 0.001% from the >> >>>> orignal relation. >> >>>> >> >>>> Assume the relation "A" contains a single column of data (int type) >> with >> >>>> 1,000,000 records >> >>>> >> >>>> Asamp = SAMPLE A 0.00001; >> >>>> Asamp2 = SAMPLE A 0.0001; >> >>>> >> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100 >> >>>> records, respectively. However, what I find is Asamp and Asamp2 are >> >>>> closer >> >>>> to 1000 and 10000 records, which seems like a 100-fold error in sample >> >>>> size. Interestingly, in the limiting case of: >> >>>> >> >>>> Asamp3 = SAMPLE A 0.99; >> >>>> >> >>>> The actual subsampled size is VERY close to the expected 99% size of >> the >> >>>> full sample size. Can anyone shed light as to what I may be doing >> wrong >> >>>> or >> >>>> share their experiences if they have also seen issues with using >> SAMPLE >> >>>> in >> >>>> PIG. Thank you. >> >>>> >> >>>> Brian >> >>>> >> >>>> >> >>> >> >> >>
|
|