Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Weird bug of REPLACE


Copy link to this message
-
Re: Re: Weird bug of REPLACE
Bill Graham 2012-09-04, 18:03
Opened a JIRA to better clarify the docs here:
https://issues.apache.org/jira/browse/PIG-2905

On Tue, Sep 4, 2012 at 12:37 AM, MiaoMiao <[EMAIL PROTECTED]> wrote:

> Pity the document of REPLACE doesn't mention about regex at all. Thank
> you so much for your reply, being able to know what's going on is such
> a relief. Now I can trust myself with pig a little more.
>
>
> On Sat, 18 Aug 2012 at 00:05:29 AM, Cheolsoo Park <[EMAIL PROTECTED]>
> wrote:
> > Hi,
>
> > If you look at the source code of REPLACE, what it does is basically:
>
> > String source = "[02/Aug/2012:05:01:17";
> > > String target ="[";
> > > String replaceWith = "";
> > > return source.replaceAll(source, target, replaceWith);
>
>
> > Note that Java String.replaceAll() takes a regular expression for the 2nd
> > parameter (i.e. target), and "[" is a special character. To use it as is,
> > you have to escape it, so in your Pig script, you should do:
>
> > REPLACE(date,'\\[','')
>
> > Now regarding the result that you're seeing, it looks like whatever
> > exception is thrown inside REPLACE is swallowed rather than makes the job
> > fail, and null is returned:
>
> >         try{
> > >             ...
> > >         }catch(Exception e){
> > >             warn("Failed to process input; error - " + e.getMessage(),
> > > PigWarning.*UDF_WARNING_1*);
> > >             return null;
> > >         }
>
>
> > But I do see the following message at the end of the job status:
>
> > 2012-08-17 16:51:25,061 [main] WARN
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Encountered *Warning UDF_WARNING_1* 1 time(s)
>
> > I must admit that this is not very visible though.
>
> > Thanks,
> > Cheolsoo
>
> > On Mon, Aug 13, 2012 at 10:03 PM, MiaoMiao <[EMAIL PROTECTED]> wrote:
>
> > > I used pig to do some ETL job, but met with a strange bug of the
> > > built-in REPLACE function.
> > >
> > > After I replace '[' with '' in '[02/Aug/2012:05:01:17' , the whole
> > > string just went blank.
> > >
> > > Here I posted some info that may help debug.
> > >
> > > My pig version is: Apache Pig version 0.11.0-SNAPSHOT (r1364475)
> > > compiled Jul 23 2012, 10:30:53
> > >
> > > The original text file:
> > > ip.ip.ip.ip - - [02/Aug/2012:05:01:17 -0600] "GET
> > > /player.php/sid/XNDM0Njk3MjEy/v.swf HTTP/1.1" 302 26
> > >
> > > The whole pig script is :
> > > read = load '/home/test/apacheLog'
> > > using PigStorage(' ')
> > > as (
> > >           ip:chararray
> > >         , indentity:chararray
> > >         , name:chararray
> > >         , date:chararray
> > >         , timezone:chararray
> > >         , method:chararray
> > >         , path:chararray
> > >         , protocol:chararray
> > >         , status:chararray
> > >         , size:chararray
> > > );
> > > dump read;
> > >
> > >
> --(ip.ip.ip.ip,-,-,[02/Aug/2012:05:01:17,-0600],"GET,/player.php/sid/XNDM0Njk3MjEy/v.swf,HTTP/1.1",302,26)
> > > data = foreach read generate
> > >           ip
> > >         , REPLACE(date,'[','')
> > >         , REPLACE(timezone,']','')
> > >         , REPLACE(method,'"','')
> > >         , path
> > >         , REPLACE(protocol,'"','')
> > >         , status
> > >         , size;
> > > describe data;
> > > --data: {ip: chararray,date: chararray,timezone: chararray,method:
> > > chararray,path: chararray,protocol: chararray,status: chararray,size:
> > > chararray}
> > > dump data;
> > >
> > >
> --(ip.ip.ip.ip,,-0600,GET,/player.php/sid/XNDM0Njk3MjEy/v.swf,HTTP/1.1,302,26)
> > >
>

--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*