-Re: Pig on EMR: how to include semicolon in regex argument of EXTRACT function
MARCOS MEDRADO RUBINELLI 2013-04-16, 11:06
It seems my first message fell through a crack, so I apologize if you
receive it twice, but: yes it is a known issu, and there isn't a stable
version with the fix yet. I see two ways to work around it:
1. write a UDF that encapsulates the regex
2. load the regex from a file
I actually tested number 2. I ran it on 0.10.0, but it should work on a
recent version of EMR too:
$ echo "test=(\\S+);?" > testregex.txt
$ hadoop fs -put testregex.txt /tmp
B = LOAD '/tmp/testregex.txt' as (regex :chararray);
str_of_interest, B.regex, 1
On 16-04-2013 02:03, Dylan Sather wrote:
> Hi y'all,
> First time on this list, and hoping you might be able to help me with a
> (possible) issue.
> I'm working with some data in Pig that includes strings of interest,
> optionally separated by semicolons and in random order, e.g.
> The following code should extract the value of the string for the test
> blah > FOREACH
> FLATTEN (
> EXTRACT (
> AS (
> test: chararray
> However, when running the code, I encounter the following error:
> <line 46, column 0> mismatched character '<EOF>' expecting '''
> 2013-04-16 04:46:05,245 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: <line 46, column 0> mismatched character '<EOF>' expecting '''
> I thought I had my regex escape syntax off at first, but that doesn't
> appear to be the problem. The only information I get from a Google search
> is a bug report (https://issues.apache.org/jira/browse/PIG-2507) that
> appears to have been recently fixed, but it's still an issue on the Amazon
> EMR cluster I'm running (spun up ad hoc, just now, for this analysis).
> As in the bug report and as suggested elsewhere, replacing the semicolon
> with its Unicode equivalent (\u003B) yields the same error.
> I could be crazy and this could be a syntax issue, so I'm hoping someone
> might be able to point me in the right direction or confirm that this is an
> existing problem. If the latter, are there any workarounds (either in Pig,
> or for matching the string I want)?