Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Error while loading UTF-8 strings into bags


Copy link to this message
-
RE: Error while loading UTF-8 strings into bags
You are probably right, TextDataParser does not handle UTF-8. I have
filed a JIRA for this issue - PIG-681
(https://issues.apache.org/jira/browse/PIG-681)

Santhosh

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
[EMAIL PROTECTED]
Sent: Tuesday, February 17, 2009 2:46 AM
To: [EMAIL PROTECTED]
Subject: Re: Error while loading UTF-8 strings into bags

I've got this error on a set of non-empty and non-ASCII strings.  If I
change strings to ASCII, the error dissapeared.

So non-ASCII UTF-8 data can be encoded with URLEncoder/URLDecoder to
be processed within pig bags as a temporary solution. However, it
requires three times more space to store and more coding to process.

2009/2/17 Santhosh Srinivasan <[EMAIL PROTECTED]>:
> This could be a bug in TextDataParser due to the presence of empty
> strings in the data.
>
> Santhosh
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
> [EMAIL PROTECTED]
> Sent: Monday, February 16, 2009 6:05 AM
> To: [EMAIL PROTECTED]
> Subject: Error while loading UTF-8 strings into bags
>
> Hi,
>
> I'm trying to use utf-8 strings as follows:
>
> phrases = load 'phrases' as (data: chararray, f: int);
> a = group phrases by f;
> b = foreach a generate group as f, phrases.data as data;
> store b into 'grouped';
>
> b = load 'grouped' as (f: int, data: bag{t: tuple(data: chararray)});
> c = foreach b generate f, data;       -- just store in this sample
> store c into 'final';
>
> [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2043: Unexpected
> error during execution.
>
> org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of
> infinite loop caused by repeated empty string matches at line 1,
> column 3.
>        at
>
org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalAction
> s(TextDataParserTokenManager.java:619)
>        at
>
org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextD
> ataParserTokenManager.java:568)
>        at
>
org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:623
> )
>        at
>
org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:153)
>        at
> org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:85)
>        at
>
org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:345)
>        at
>
org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42)
>        at
>
org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageCo
> nverter.java:71)
>        at
>
org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConver
> ter.java:79)
>        at
>
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOp
> erators.POCast.getNext(POCast.java:908)
>        at
>
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
> erators.POForEach.processPlan(POForEach.java:244)
>        at
>
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
> erators.POForEach.getNext(POForEach.java:198)
>        at
>
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOper
> ator.processInput(PhysicalOperator.java:226)
>        at
>
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
> erators.POForEach.getNext(POForEach.java:187)
>        at
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.
> runPipeline(PigMapBase.java:203)
>        at
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.
> map(PigMapBase.java:194)
>        at
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$
> Map.map(PigMapOnly.java:65)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>        at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> Is there any way to use utf-8 strings in pig bags?
> Thanks.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB