Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Error while loading UTF-8 strings into bags


Copy link to this message
-
RE: Error while loading UTF-8 strings into bags
You are probably right, TextDataParser does not handle UTF-8. I have
filed a JIRA for this issue - PIG-681
(https://issues.apache.org/jira/browse/PIG-681)

Santhosh

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
[EMAIL PROTECTED]
Sent: Tuesday, February 17, 2009 2:46 AM
To: [EMAIL PROTECTED]
Subject: Re: Error while loading UTF-8 strings into bags

I've got this error on a set of non-empty and non-ASCII strings.  If I
change strings to ASCII, the error dissapeared.

So non-ASCII UTF-8 data can be encoded with URLEncoder/URLDecoder to
be processed within pig bags as a temporary solution. However, it
requires three times more space to store and more coding to process.

2009/2/17 Santhosh Srinivasan <[EMAIL PROTECTED]>:
> This could be a bug in TextDataParser due to the presence of empty
> strings in the data.
>
> Santhosh
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
> [EMAIL PROTECTED]
> Sent: Monday, February 16, 2009 6:05 AM
> To: [EMAIL PROTECTED]
> Subject: Error while loading UTF-8 strings into bags
>
> Hi,
>
> I'm trying to use utf-8 strings as follows:
>
> phrases = load 'phrases' as (data: chararray, f: int);
> a = group phrases by f;
> b = foreach a generate group as f, phrases.data as data;
> store b into 'grouped';
>
> b = load 'grouped' as (f: int, data: bag{t: tuple(data: chararray)});
> c = foreach b generate f, data;       -- just store in this sample
> store c into 'final';
>
> [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2043: Unexpected
> error during execution.
>
> org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of
> infinite loop caused by repeated empty string matches at line 1,
> column 3.
>        at
>
org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalAction
> s(TextDataParserTokenManager.java:619)
>        at
>
org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextD
> ataParserTokenManager.java:568)
>        at
>
org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:623
> )
>        at
>
org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:153)
>        at
> org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:85)
>        at
>
org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:345)
>        at
>
org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42)
>        at
>
org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageCo
> nverter.java:71)
>        at
>
org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConver
> ter.java:79)
>        at
>
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOp
> erators.POCast.getNext(POCast.java:908)
>        at
>
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
> erators.POForEach.processPlan(POForEach.java:244)
>        at
>
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
> erators.POForEach.getNext(POForEach.java:198)
>        at
>
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOper
> ator.processInput(PhysicalOperator.java:226)
>        at
>
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
> erators.POForEach.getNext(POForEach.java:187)
>        at
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.
> runPipeline(PigMapBase.java:203)
>        at
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.
> map(PigMapBase.java:194)
>        at
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$
> Map.map(PigMapOnly.java:65)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>        at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> Is there any way to use utf-8 strings in pig bags?
> Thanks.
>