Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Pig on EMR: how to include semicolon in regex argument of EXTRACT function

Copy link to this message
Re: Pig on EMR: how to include semicolon in regex argument of EXTRACT function

It seems my first message fell through a crack, so I apologize if you
receive it twice, but: yes it is a known issu, and there isn't a stable
version with the fix yet. I see two ways to work around it:

1. write a UDF that encapsulates the regex

2. load the regex from a file

I actually tested number 2. I ran it on 0.10.0, but it should work on a
recent version of EMR too:

$ echo "test=(\\S+);?" > testregex.txt
$ hadoop fs -put testregex.txt /tmp

B = LOAD '/tmp/testregex.txt' as (regex :chararray);

blah        FOREACH
         FLATTEN (
           REGEX_EXTRACT (
             str_of_interest, B.regex, 1
         AS (
           test: chararray


On 16-04-2013 02:03, Dylan Sather wrote:
> Hi y'all,
> First time on this list, and hoping you might be able to help me with a
> (possible) issue.
> I'm working with some data in Pig that includes strings of interest,
> optionally separated by semicolons and in random order, e.g.
>      test=12345;foo=bar
>      test=12345
>      foo=bar;test=12345
> The following code should extract the value of the string for the test
> 'key':
>      blah >        FOREACH
>          data
>        GENERATE
>          FLATTEN (
>            EXTRACT (
>              str_of_interest,
>              'test=(\\S+);?'
>            )
>          )
>          AS (
>            test: chararray
>          )
>        ;
> However, when running the code, I encounter the following error:
>      <line 46, column 0>  mismatched character '<EOF>' expecting '''
>      2013-04-16 04:46:05,245 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: <line 46, column 0>  mismatched character '<EOF>' expecting '''
> I thought I had my regex escape syntax off at first, but that doesn't
> appear to be the problem. The only information I get from a Google search
> is a bug report (https://issues.apache.org/jira/browse/PIG-2507) that
> appears to have been recently fixed, but it's still an issue on the Amazon
> EMR cluster I'm running (spun up ad hoc, just now, for this analysis).
> As in the bug report and as suggested elsewhere, replacing the semicolon
> with its Unicode equivalent (\u003B) yields the same error.
> I could be crazy and this could be a syntax issue, so I'm hoping someone
> might be able to point me in the right direction or confirm that this is an
> existing problem. If the latter, are there any workarounds (either in Pig,
> or for matching the string I want)?
> Cheers.
> Dylan