|
Nerius Landys
2012-05-17, 17:57
krishnan N
2012-05-17, 19:44
Nerius Landys
2012-05-17, 21:36
Ranjith
2012-05-17, 22:40
Dan Young
2012-05-17, 22:42
Nerius Landys
2012-05-17, 23:02
Dan Young
2012-05-17, 23:06
Nerius Landys
2012-05-17, 23:12
Dan Young
2012-05-17, 23:24
Nerius Landys
2012-05-17, 23:26
Dan Young
2012-05-17, 23:30
Dan Young
2012-05-17, 23:37
Nerius Landys
2012-05-17, 23:39
Norbert Burger
2012-05-18, 16:25
Nerius Landys
2012-05-18, 17:13
Norbert Burger
2012-05-19, 00:02
Nerius Landys
2012-05-19, 02:59
|
-
STRSPLIT problems (or UDF shortcoming?)Nerius Landys 2012-05-17, 17:57
I'm having problems using Pig's STRSPLIT (on Amazon's cloud computing
environment). I also noticed that STRSPLIT isn't documented in the Pig Latin Reference Manual, so I found out about it through other sources of information. My problem is that in certain cases STRSPLIT returns null. I have no idea why. Here is an acual session I ran to demonstrate the problem: grunt> CAT s3://otg-nlandys/pig-tut/bin-proto-4; Meta 1234567890 foo 34 Movement 1234567890 Rambetter 1/1 2/3 Movement 1234567890 Freddyman 10/1 10/2 grunt> A = LOAD 's3://otg-nlandys/pig-tut/bin-proto-4'; grunt> DUMP A; (Meta,1234567890,foo,34) (Movement,1234567890,Rambetter,1/1,2/3) (Movement,1234567890,Freddyman,10/1,10/2) grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; grunt> DUMP MOVEMENT; (Movement,1234567890,Rambetter,1/1,2/3) (Movement,1234567890,Freddyman,10/1,10/2) grunt> TEST = FOREACH MOVEMENT GENERATE $3 AS startpos:chararray; grunt> DUMP TEST; (1/1) (10/1) grunt> POSA = FOREACH TEST GENERATE STRSPLIT(startpos,'/'); grunt> DUMP POSA; () () _________________________________________________________________ grunt> CAT s3://otg-nlandys/pig-tut/bin-proto-5; 1/1 10/1 grunt> B = LOAD 's3://otg-nlandys/pig-tut/bin-proto-5' AS startpos:chararray; grunt> DUMP B; (1/1) (10/1) grunt> POSB = FOREACH B GENERATE STRSPLIT(startpos,'/'); grunt> DUMP POSB; ((1,1)) ((10,1)) _________________________________________________________________ My question is why POSA is empty rows and POSB isn't empty rows, when it seems that they should be identical. I'm kind of new to Pig and realize that the problem might be a shortcoming of UDF's and how Pig works with data of varying column count, but would like an explanation. Thanks. Also one other minor bug with STRSPLIT that I noticed. If your first argument to STRSPLIT is bytearray instead of chararray, it will return null. So you have to explicitly cast bytearray to chararray for it to work. Seems that this could be automated in the language, no? - Nerius
-
Re: STRSPLIT problems (or UDF shortcoming?)krishnan N 2012-05-17, 19:44
Hi ,
I did the same but with one changes , that is I changed the file column delimiter to ',' and it worked. ((34)) ((1,1)) ((10,1)) Please try the same. Thanks Krishnan On Thu, May 17, 2012 at 10:57 AM, Nerius Landys <[EMAIL PROTECTED]> wrote: > I'm having problems using Pig's STRSPLIT (on Amazon's cloud computing > environment). > I also noticed that STRSPLIT isn't documented in the Pig Latin > Reference Manual, so I found out about it through other sources of > information. > > My problem is that in certain cases STRSPLIT returns null. I have no > idea why. Here is an acual session I ran to demonstrate the problem: > > > > grunt> CAT s3://otg-nlandys/pig-tut/bin-proto-4; > Meta 1234567890 foo 34 > Movement 1234567890 Rambetter 1/1 2/3 > Movement 1234567890 Freddyman 10/1 10/2 > > grunt> A = LOAD 's3://otg-nlandys/pig-tut/bin-proto-4'; > grunt> DUMP A; > (Meta,1234567890,foo,34) > (Movement,1234567890,Rambetter,1/1,2/3) > (Movement,1234567890,Freddyman,10/1,10/2) > > grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; > grunt> DUMP MOVEMENT; > (Movement,1234567890,Rambetter,1/1,2/3) > (Movement,1234567890,Freddyman,10/1,10/2) > > grunt> TEST = FOREACH MOVEMENT GENERATE $3 AS startpos:chararray; > grunt> DUMP TEST; > (1/1) > (10/1) > > grunt> POSA = FOREACH TEST GENERATE STRSPLIT(startpos,'/'); > grunt> DUMP POSA; > () > () > > _________________________________________________________________ > > > grunt> CAT s3://otg-nlandys/pig-tut/bin-proto-5; > 1/1 > 10/1 > > grunt> B = LOAD 's3://otg-nlandys/pig-tut/bin-proto-5' AS > startpos:chararray; > grunt> DUMP B; > (1/1) > (10/1) > > grunt> POSB = FOREACH B GENERATE STRSPLIT(startpos,'/'); > grunt> DUMP POSB; > ((1,1)) > ((10,1)) > > > _________________________________________________________________ > > > My question is why POSA is empty rows and POSB isn't empty rows, when > it seems that they should be identical. > > I'm kind of new to Pig and realize that the problem might be a > shortcoming of UDF's and how Pig works with data of varying column > count, but would like an explanation. Thanks. > > Also one other minor bug with STRSPLIT that I noticed. If your first > argument to STRSPLIT is bytearray instead of chararray, it will return > null. So you have to explicitly cast bytearray to chararray for it to > work. Seems that this could be automated in the language, no? > > - Nerius >
-
Re: STRSPLIT problems (or UDF shortcoming?)Nerius Landys 2012-05-17, 21:36
> I did the same but with one changes , that is I changed the file column
> delimiter to ',' and it worked. I've tried both '/' and ',' as delimiters for the STRSPLIT function and both fail in my example.
-
Re: STRSPLIT problems (or UDF shortcoming?)Ranjith 2012-05-17, 22:40
This is pretty interesting. Shot in the dark but can you try the STRSPLIT with -1 and one of the input values, for example, STRSPLIT(abc,'/',-1).
Thanks, Ranjith On May 17, 2012, at 4:36 PM, Nerius Landys <[EMAIL PROTECTED]> wrote: >> I did the same but with one changes , that is I changed the file column >> delimiter to ',' and it worked. > > I've tried both '/' and ',' as delimiters for the STRSPLIT function > and both fail in my example.
-
Re: STRSPLIT problems (or UDF shortcoming?)Dan Young 2012-05-17, 22:42
Did you try to escape the backslash?
Dano On Thu, May 17, 2012 at 11:57 AM, Nerius Landys <[EMAIL PROTECTED]> wrote: > I'm having problems using Pig's STRSPLIT (on Amazon's cloud computing > environment). > I also noticed that STRSPLIT isn't documented in the Pig Latin > Reference Manual, so I found out about it through other sources of > information. > > My problem is that in certain cases STRSPLIT returns null. I have no > idea why. Here is an acual session I ran to demonstrate the problem: > > > > grunt> CAT s3://otg-nlandys/pig-tut/bin-proto-4; > Meta 1234567890 foo 34 > Movement 1234567890 Rambetter 1/1 2/3 > Movement 1234567890 Freddyman 10/1 10/2 > > grunt> A = LOAD 's3://otg-nlandys/pig-tut/bin-proto-4'; > grunt> DUMP A; > (Meta,1234567890,foo,34) > (Movement,1234567890,Rambetter,1/1,2/3) > (Movement,1234567890,Freddyman,10/1,10/2) > > grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; > grunt> DUMP MOVEMENT; > (Movement,1234567890,Rambetter,1/1,2/3) > (Movement,1234567890,Freddyman,10/1,10/2) > > grunt> TEST = FOREACH MOVEMENT GENERATE $3 AS startpos:chararray; > grunt> DUMP TEST; > (1/1) > (10/1) > > grunt> POSA = FOREACH TEST GENERATE STRSPLIT(startpos,'/'); > grunt> DUMP POSA; > () > () > > _________________________________________________________________ > > > grunt> CAT s3://otg-nlandys/pig-tut/bin-proto-5; > 1/1 > 10/1 > > grunt> B = LOAD 's3://otg-nlandys/pig-tut/bin-proto-5' AS > startpos:chararray; > grunt> DUMP B; > (1/1) > (10/1) > > grunt> POSB = FOREACH B GENERATE STRSPLIT(startpos,'/'); > grunt> DUMP POSB; > ((1,1)) > ((10,1)) > > > _________________________________________________________________ > > > My question is why POSA is empty rows and POSB isn't empty rows, when > it seems that they should be identical. > > I'm kind of new to Pig and realize that the problem might be a > shortcoming of UDF's and how Pig works with data of varying column > count, but would like an explanation. Thanks. > > Also one other minor bug with STRSPLIT that I noticed. If your first > argument to STRSPLIT is bytearray instead of chararray, it will return > null. So you have to explicitly cast bytearray to chararray for it to > work. Seems that this could be automated in the language, no? > > - Nerius >
-
Re: STRSPLIT problems (or UDF shortcoming?)Nerius Landys 2012-05-17, 23:02
> Did you try to escape the backslash?
I just tried this: POSA = FOREACH TEST GENERATE STRSPLIT(startpos,'\\u002F'); ... and still the same result. By the way I'm using a forward slash for the separator character. I also tried this: POSA = FOREACH TEST GENERATE STRSPLIT(startpos,'/',-1); ... and still getting null rows. If you look at my original post you'll see that the data contained in POSA and POSB should be identical. There's something that's getting screwy during the processing stage, where processing functions are "concatenated" together. If I save the output from each step to a file and load it back in, things work fine. I demonstrated this in my original post. Very strange, but I really need to get this resolved.
-
Re: STRSPLIT problems (or UDF shortcoming?)Dan Young 2012-05-17, 23:06
What version of pig are you using on EMR?
On May 17, 2012 5:02 PM, "Nerius Landys" <[EMAIL PROTECTED]> wrote: > > Did you try to escape the backslash? > > I just tried this: > > POSA = FOREACH TEST GENERATE STRSPLIT(startpos,'\\u002F'); > > ... and still the same result. By the way I'm using a forward slash > for the separator character. > I also tried this: > > POSA = FOREACH TEST GENERATE STRSPLIT(startpos,'/',-1); > > ... and still getting null rows. > > If you look at my original post you'll see that the data contained in > POSA and POSB should be identical. There's something that's getting > screwy during the processing stage, where processing functions are > "concatenated" together. If I save the output from each step to a > file and load it back in, things work fine. I demonstrated this in my > original post. > > Very strange, but I really need to get this resolved. >
-
Re: STRSPLIT problems (or UDF shortcoming?)Nerius Landys 2012-05-17, 23:12
> What version of pig are you using on EMR?
hadoop@ip-10-190-83-146:~$ pig --version Apache Pig version 0.9.2-amzn (rexported) compiled Apr 06 2012, 23:48:53
-
Re: STRSPLIT problems (or UDF shortcoming?)Dan Young 2012-05-17, 23:24
Have you tried 0.10?
On May 17, 2012 5:13 PM, "Nerius Landys" <[EMAIL PROTECTED]> wrote: > > What version of pig are you using on EMR? > > hadoop@ip-10-190-83-146:~$ pig --version > Apache Pig version 0.9.2-amzn (rexported) > compiled Apr 06 2012, 23:48:53 >
-
Re: STRSPLIT problems (or UDF shortcoming?)Nerius Landys 2012-05-17, 23:26
> Have you tried 0.10?
No but I can and will try it. I've been using whatever is on Amazon because that is the system that we'll be using. I'll report back on my findings.
-
Re: STRSPLIT problems (or UDF shortcoming?)Dan Young 2012-05-17, 23:30
We ended up using 0.10 on EMR and its been working fine so far...
Dano On May 17, 2012 5:26 PM, "Nerius Landys" <[EMAIL PROTECTED]> wrote: > > Have you tried 0.10? > > No but I can and will try it. I've been using whatever is on Amazon > because that is the system that we'll be using. > I'll report back on my findings. >
-
Re: STRSPLIT problems (or UDF shortcoming?)Dan Young 2012-05-17, 23:37
A quick test would be to scp the 0.10 pig.jar over to your master node,
and then run: hadoop -jar pig.jar . Run your script in grunt... Dano On May 17, 2012 5:26 PM, "Nerius Landys" <[EMAIL PROTECTED]> wrote: > > Have you tried 0.10? > > No but I can and will try it. I've been using whatever is on Amazon > because that is the system that we'll be using. > I'll report back on my findings. >
-
Re: STRSPLIT problems (or UDF shortcoming?)Nerius Landys 2012-05-17, 23:39
> We ended up using 0.10 on EMR and its been working fine so far...
OK a bit of bad news. 0.10 did not fix my problem. I'll recap the entire situation. HADOOP_HOME is set to hadoop-0.20.205.0, Pig version is now pig-0.10.0. File 'bin-proto-4' is: Meta 1234567890 foo 34 Movement 1234567890 Rambetter 1/1 2/3 Movement 1234567890 Freddyman 10/1 10/2 (with tab delimiters) grunt> A = LOAD 'bin-proto-4'; grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; grunt> TEST = FOREACH MOVEMENT GENERATE $3 AS startpos:chararray; grunt> POSA = FOREACH TEST GENERATE STRSPLIT(startpos,'/'); grunt> DUMP POSA; () () grunt> DUMP TEST; (1/1) (10/1) Ran this on my local machine just now.
-
Re: STRSPLIT problems (or UDF shortcoming?)Norbert Burger 2012-05-18, 16:25
>From what I can tell, this does seem like a bug. Switching to positional
specifiers seems to work around the issue: TEST = FOREACH MOVEMENT GENERATE $3; POSA = FOREACH TEST GENERATE STRSPLIT($0, '/'); Possibly some casting is being applied in one case (positional specifiers) but not the other? Norbert On Thu, May 17, 2012 at 7:39 PM, Nerius Landys <[EMAIL PROTECTED]> wrote: > > We ended up using 0.10 on EMR and its been working fine so far... > > OK a bit of bad news. 0.10 did not fix my problem. > I'll recap the entire situation. > HADOOP_HOME is set to hadoop-0.20.205.0, Pig version is now pig-0.10.0. > > File 'bin-proto-4' is: > > Meta 1234567890 foo 34 > Movement 1234567890 Rambetter 1/1 2/3 > Movement 1234567890 Freddyman 10/1 10/2 > > (with tab delimiters) > > grunt> A = LOAD 'bin-proto-4'; > grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; > grunt> TEST = FOREACH MOVEMENT GENERATE $3 AS startpos:chararray; > grunt> POSA = FOREACH TEST GENERATE STRSPLIT(startpos,'/'); > grunt> DUMP POSA; > () > () > > grunt> DUMP TEST; > (1/1) > (10/1) > > Ran this on my local machine just now. >
-
Re: STRSPLIT problems (or UDF shortcoming?)Nerius Landys 2012-05-18, 17:13
> From what I can tell, this does seem like a bug. Switching to positional
> specifiers seems to work around the issue: > > TEST = FOREACH MOVEMENT GENERATE $3; > POSA = FOREACH TEST GENERATE STRSPLIT($0, '/'); > > Possibly some casting is being applied in one case (positional specifiers) > but not the other? Wow I just made a very interesting finding after trying your advice. The two sessions below are identical except for lines 2 and 7. Line 7 has the "AS startpos:chararray", whereas line 2 has no "AS". 0. grunt> A = LOAD 'bin-proto-4'; 1. grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; 2. grunt> TEST = FOREACH MOVEMENT GENERATE $3; 3. grunt> POSA = FOREACH TEST GENERATE STRSPLIT($0, '/'); 4. grunt> DUMP POSA; ((1,1)) ((10,1)) 5. grunt> A = LOAD 'bin-proto-4'; 6. grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; 7. grunt> TEST = FOREACH MOVEMENT GENERATE $3 AS startpos:chararray; 8. grunt> POSA = FOREACH TEST GENERATE STRSPLIT($0, '/'); 9. grunt> DUMP POSA; () ()
-
Re: STRSPLIT problems (or UDF shortcoming?)Norbert Burger 2012-05-19, 00:02
Right - this was my point. Dropping the 'as' clause forces you to use
positional specifiers, which don't seem to have the same issue. Seems like this would warrant a JIRA, if only to document the distinction a bit better. Norbert On Fri, May 18, 2012 at 1:13 PM, Nerius Landys <[EMAIL PROTECTED]> wrote: > > From what I can tell, this does seem like a bug. Switching to positional > > specifiers seems to work around the issue: > > > > TEST = FOREACH MOVEMENT GENERATE $3; > > POSA = FOREACH TEST GENERATE STRSPLIT($0, '/'); > > > > Possibly some casting is being applied in one case (positional > specifiers) > > but not the other? > > Wow I just made a very interesting finding after trying your advice. > The two sessions below are identical except for lines 2 and 7. Line 7 > has the "AS startpos:chararray", whereas line 2 has no "AS". > > 0. grunt> A = LOAD 'bin-proto-4'; > 1. grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; > 2. grunt> TEST = FOREACH MOVEMENT GENERATE $3; > 3. grunt> POSA = FOREACH TEST GENERATE STRSPLIT($0, '/'); > 4. grunt> DUMP POSA; > ((1,1)) > ((10,1)) > > 5. grunt> A = LOAD 'bin-proto-4'; > 6. grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; > 7. grunt> TEST = FOREACH MOVEMENT GENERATE $3 AS startpos:chararray; > 8. grunt> POSA = FOREACH TEST GENERATE STRSPLIT($0, '/'); > 9. grunt> DUMP POSA; > () > () >
-
Re: STRSPLIT problems (or UDF shortcoming?)Nerius Landys 2012-05-19, 02:59
> Right - this was my point. Dropping the 'as' clause forces you to use
> positional specifiers, which don't seem to have the same issue. Seems like > this would warrant a JIRA, if only to document the distinction a bit better. Yeah but it my example I _am_ using position specifiers in the STRSPLIT function, and it fails. The thing that apparently makes it fail is just having the named column or column type defined on the relation. See line 8 below - positional specifier. >> 5. grunt> A = LOAD 'bin-proto-4'; >> 6. grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; >> 7. grunt> TEST = FOREACH MOVEMENT GENERATE $3 AS startpos:chararray; >> 8. grunt> POSA = FOREACH TEST GENERATE STRSPLIT($0, '/'); >> 9. grunt> DUMP POSA; >> () >> () |