|
|
-
How to access to the tuple items of REGEX_EXTRACT_ALL ?
brice lecomte 2013-02-27, 16:49
Hello, I'd like to access straitght forward to the result of: grunt> c = foreach logs generate REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3}) ([0-9]{1,2}) ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}) ([a-zA-Z0-9-_]+) ([a-zA-Z]+)\\[[0-9]+\\]: (.*)'); grunt> illustrate c;
------------------------------------------------------------------------------------------------------------- | logs | f1:chararray | ------------------------------------------------------------------------------------------------------------- | | Feb 24 20:09:01 hadoop-master CRON[3574]: pam_unix(cron:session): session closed for user root | ------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------- | c | org.apache.pig.builtin.regex_extract_all_f1_178:tuple() | ---------------------------------------------------------------------------- | | (Feb, ..., pam_unix(cron:session): session closed for user root) | ----------------------------------------------------------------------------
but the only way I found is to save&reload it:
grunt> store c into 'pig/AUTH.result'; grunt> auth = LOAD 'pig/AUTH.result/part-m-00000' USING PigStorage(',') AS (m:chararray, d:int, time:chararray, hostname:chararray, service:chararray, info:chararray); grunt> day_frequency = GROUP auth by (d,service); ...
is there a way to name the tuple items or to access them such as c.$0 or FLATTEN(c).$0.... ??
Thanks, Brice
+
brice lecomte 2013-02-27, 16:49
-
Re: How to access to the tuple items of REGEX_EXTRACT_ALL ?
Johnny Zhang 2013-02-27, 19:26
Hi, Brice: Instead of save&reload it, can you try 'dump c;' first then use c.$0 ?
Johnny On Wed, Feb 27, 2013 at 8:49 AM, brice lecomte <[EMAIL PROTECTED]> wrote:
> Hello, > --Pig 0.10.0-- > I'd like to access straitght forward to the result of: > grunt> c = foreach logs generate REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3}) > ([0-9]{1,2}) ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}) > ([a-zA-Z0-9-_]+) ([a-zA-Z]+)\\[[0-9]+\\]: (.*)'); > grunt> illustrate c; > > > ------------------------------------------------------------------------------------------------------------- > | logs | > f1:chararray > | > > ------------------------------------------------------------------------------------------------------------- > | | Feb 24 20:09:01 hadoop-master CRON[3574]: > pam_unix(cron:session): session closed for user root | > > ------------------------------------------------------------------------------------------------------------- > > ---------------------------------------------------------------------------- > | c | org.apache.pig.builtin.regex_extract_all_f1_178:tuple() > | > > ---------------------------------------------------------------------------- > | | (Feb, ..., pam_unix(cron:session): session closed for user root) > | > > ---------------------------------------------------------------------------- > > but the only way I found is to save&reload it: > > grunt> store c into 'pig/AUTH.result'; > grunt> auth = LOAD 'pig/AUTH.result/part-m-00000' USING PigStorage(',') > AS (m:chararray, d:int, time:chararray, hostname:chararray, > service:chararray, info:chararray); > grunt> day_frequency = GROUP auth by (d,service); > ... > > is there a way to name the tuple items or to access them such as c.$0 or > FLATTEN(c).$0.... ?? > > Thanks, > Brice > >
+
Johnny Zhang 2013-02-27, 19:26
-
Re: How to access to the tuple items of REGEX_EXTRACT_ALL ?
brice lecomte 2013-02-28, 10:27
Hi Johnny, bad things,
grunt> REGISTER json-simple-1.1.1.jar grunt> REGISTER lib/jackson-core-asl-1.8.8.jar grunt> REGISTER lib/jackson-mapper-asl-1.8.8.jar grunt> REGISTER /usr/local/pig-0.10.1-src/build/ivy/lib/Pig/avro-1.5.3.jar grunt> REGISTER /usr/local/pig-0.10.1-src/contrib/piggybank/java/piggybank.jar grunt> logs = LOAD 'auth.log' as (f1:chararray); grunt> c = foreach logs generate REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3}) ([0-9]{1,2}) ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}) ([a-zA-Z0-9-_]+) ([a-zA-Z]+)\\[[0-9]+\\]: (.*)'); grunt> df = GROUP c by ($1, $4); 2013-02-28 10:57:32,630 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: <line 3, column 17> Out of bound access. Trying to access non-existent column: 1. Schema org.apache.pig.builtin.regex_extract_all_f1_4:tuple() *has 1 column(s)*. Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log grunt> dump c; [...]
*((Feb,28,10:50:13,hadoop-master,sshd,debug1: session_input_channel_req: session 0 req window-change))*
=> looks like a tuple of tuple ?
grunt> df = GROUP c by ($1, $4); 2013-02-28 10:57:59,274 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: <line 3, column 17> Out of bound access. Trying to access non-existent column: 1. Schema org.apache.pig.builtin.regex_extract_all_f1_10:tuple() has 1 column(s). Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log grunt> df = GROUP c by (c.$1, c.$4); 2013-02-28 10:58:06,873 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 3, column 17> Invalid scalar projection: c Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log grunt> df = GROUP c by (c.$0.$1, c.$0.$4); grunt> dump df; [...]
2013-02-28 10:58:46,781 [Thread-16] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0003 org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : ((Feb,24,07:39:01,hadoop-master,CRON,pam_unix(cron:session): session opened for user root by (uid=0))), 2nd :((Feb,24,07:39:01,hadoop-master,CRON,pam_unix(cron:session): session closed for user root)) at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111) [...]
grunt> df = GROUP c by (FLATTEN(c.$1), FLATTEN(c.$4)); 2013-02-28 10:59:31,187 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 4, column 25> Invalid scalar projection: c Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
grunt> df = GROUP c by (FLATTEN(c).$1, FLATTEN(c).$4); 2013-02-28 10:59:51,062 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 4, column 25> Invalid scalar projection: c : A column needs to be projected from a relation for it to be used as a scalar Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
grunt> df = GROUP c by (FLATTEN(c.$0).$1, FLATTEN(c.$0).$4); 2013-02-28 11:17:46,744 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve FLATTEN using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
even tried the perl way: grunt> (m:chararray, d:int, time:chararray, hostname:chararray, service:chararray, info:chararray) = foreach logs generate REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3}) ([0-9]{1,2}) ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}) ([a-zA-Z0-9-_]+) ([a-zA-Z]+)\\[[0-9]+\\]: (.*)'); 2013-02-28 11:23:47,995 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 1. Encountered: "(" (40), after : ""
:(
Le 27/02/2013 20:26, Johnny Zhang a �crit : > Hi, Brice: > Instead of save&reload it, can you try 'dump c;' first then use c.$0 ? > > Johnny > > > On Wed, Feb 27, 2013 at 8:49 AM, brice lecomte <[EMAIL PROTECTED]> wrote: > >> Hello, >> --Pig 0.10.0-- >> I'd like to access straitght forward to the result of: >> grunt> c = foreach logs generate REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3})
+
brice lecomte 2013-02-28, 10:27
-
Re: How to access to the tuple items of REGEX_EXTRACT_ALL ?
brice lecomte 2013-02-28, 14:48
good news: need to cast export from REGEX to be used by FLATTEN and then named items such as : LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN((tuple(CHARARRAY,int,CHARARRAY,CHARARRAY,CHARARRAY,CHARARRAY))REGEX_EXTRACT_ALL(line, '([a-zA-Z]{3,3}) ([0-9]{1,2}) ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}) ([a-zA-Z0-9-_]+) ([a-zA-Z]+)\\[[0-9]+\\]: (.*)')) as (m:chararray, d:int, time:chararray, hostname:chararray, service:chararray, info:chararray); Le 28/02/2013 11:27, brice lecomte a �crit : > Hi Johnny, > bad things, > > grunt> REGISTER json-simple-1.1.1.jar > grunt> REGISTER lib/jackson-core-asl-1.8.8.jar > grunt> REGISTER lib/jackson-mapper-asl-1.8.8.jar > grunt> REGISTER /usr/local/pig-0.10.1-src/build/ivy/lib/Pig/avro-1.5.3.jar > grunt> REGISTER > /usr/local/pig-0.10.1-src/contrib/piggybank/java/piggybank.jar > grunt> logs = LOAD 'auth.log' as (f1:chararray); > grunt> c = foreach logs generate REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3}) > ([0-9]{1,2}) ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}) > ([a-zA-Z0-9-_]+) ([a-zA-Z]+)\\[[0-9]+\\]: (.*)'); > grunt> df = GROUP c by ($1, $4); > 2013-02-28 10:57:32,630 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1000: > <line 3, column 17> Out of bound access. Trying to access non-existent > column: 1. Schema org.apache.pig.builtin.regex_extract_all_f1_4:tuple() > *has 1 column(s)*. > Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log > grunt> dump c; > [...] > > *((Feb,28,10:50:13,hadoop-master,sshd,debug1: session_input_channel_req: > session 0 req window-change))* > > => looks like a tuple of tuple ? > > grunt> df = GROUP c by ($1, $4); > 2013-02-28 10:57:59,274 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1000: > <line 3, column 17> Out of bound access. Trying to access non-existent > column: 1. Schema org.apache.pig.builtin.regex_extract_all_f1_10:tuple() > has 1 column(s). > Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log > grunt> df = GROUP c by (c.$1, c.$4); > 2013-02-28 10:58:06,873 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1200: Pig script failed to parse: > <line 3, column 17> Invalid scalar projection: c > Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log > grunt> df = GROUP c by (c.$0.$1, c.$0.$4); > grunt> dump df; > [...] > > 2013-02-28 10:58:46,781 [Thread-16] WARN > org.apache.hadoop.mapred.LocalJobRunner - job_local_0003 > org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar > has more than one row in the output. 1st : > ((Feb,24,07:39:01,hadoop-master,CRON,pam_unix(cron:session): session > opened for user root by (uid=0))), 2nd > :((Feb,24,07:39:01,hadoop-master,CRON,pam_unix(cron:session): session > closed for user root)) > at > org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111) > [...] > > grunt> df = GROUP c by (FLATTEN(c.$1), FLATTEN(c.$4)); > 2013-02-28 10:59:31,187 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1200: Pig script failed to parse: > <line 4, column 25> Invalid scalar projection: c > Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log > > grunt> df = GROUP c by (FLATTEN(c).$1, FLATTEN(c).$4); > 2013-02-28 10:59:51,062 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1200: Pig script failed to parse: > <line 4, column 25> Invalid scalar projection: c : A column needs to be > projected from a relation for it to be used as a scalar > Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log > > grunt> df = GROUP c by (FLATTEN(c.$0).$1, FLATTEN(c.$0).$4); > 2013-02-28 11:17:46,744 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1070: Could not resolve FLATTEN using imports: [, > org.apache.pig.builtin., org.apache.pig.impl.builtin.] > Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log > > even tried the perl way: > grunt> (m:chararray, d:int, time:chararray, hostname:chararray, > service:chararray, info:chararray) = foreach logs generate > REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3}) ([0-9]{1,2})
+
brice lecomte 2013-02-28, 14:48
|
|