|
|
-
Pig FILTER with INDEXOF not working
Steve Watt 2011-04-22, 21:25
Hi Folks
I've done a load of a dataset and I am attempting to filter out unwanted records by checking that one of my tuple fields contains a particular string. I've distilled this issue down to the sample excite.log that ships with Pig for easy recreation. I've read through the INDEXOF code and I think this should work (lots of queries that contain the word yahoo) but my queries dump always contains zero records. Can anyone tell me what I am doing wrong?
raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, query); queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0); dump queries;
Regards Steve Watt
-
Re: Pig FILTER with INDEXOF not working
Richard Ding 2011-04-22, 22:17
raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, query:chararray); queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0); dump queries; On 4/22/11 2:25 PM, "Steve Watt" <[EMAIL PROTECTED]> wrote:
Hi Folks
I've done a load of a dataset and I am attempting to filter out unwanted records by checking that one of my tuple fields contains a particular string. I've distilled this issue down to the sample excite.log that ships with Pig for easy recreation. I've read through the INDEXOF code and I think this should work (lots of queries that contain the word yahoo) but my queries dump always contains zero records. Can anyone tell me what I am doing wrong?
raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, query); queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0); dump queries;
Regards Steve Watt
-
Re: Pig FILTER with INDEXOF not working
Steve Watt 2011-04-23, 00:30
Richard, if you're coming to OSCON or Hadoop Summit, please let me know so I can buy you a beer. Thanks for the help. This now works for with the excite log using PigStorage();
It is however still not working with my custom LoadFunc and data. For reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache Nutch Segments that reads in each page that is crawled and represents it as a Tuple of (Url, ContentType, PageContent) as shown in the script below:
webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data' using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray, content:chararray); companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0); dump companies;
This keeps failing with ERROR 1071: Cannot convert a generic_writablecomparable to a String. However, If I change the script to the following (remove schema type & straight dump after load), it works:
webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data' using com.hp.demo.SegmentLoader() AS (url, type, content); dump webcrawl;
Clearly, as soon as I inject types into the Load Schema it starts bombing. Can anyone tell me what I am doing wrong? I have attached my Nutch LoadFunc below for reference:
public class SegmentLoader extends FileInputLoadFunc {
private SequenceFileRecordReader<WritableComparable, Content> reader; protected static final Log LOG = LogFactory.getLog(SegmentLoader.class); @Override public void setLocation(String location, Job job) throws IOException { FileInputFormat.setInputPaths(job, location); } @SuppressWarnings("unchecked") @Override public InputFormat getInputFormat() throws IOException { return new SequenceFileInputFormat<WritableComparable, Content>(); }
@SuppressWarnings("unchecked") @Override public void prepareToRead(RecordReader reader, PigSplit split) throws IOException { this.reader = (SequenceFileRecordReader) reader; }
@Override public Tuple getNext() throws IOException { try { if (!reader.nextKeyValue()){ return null; } Content value = ((Content)reader.getCurrentValue()); String url = value.getUrl(); String type = value.getContentType(); String content = value.getContent().toString(); Tuple tuple = TupleFactory.getInstance().newTuple(3); tuple.set(0, new DataByteArray(url)); tuple.set(1, new DataByteArray(type)); tuple.set(2, new DataByteArray(content)); return tuple; } catch (InterruptedException e){ throw new ExecException(e); } }
}
On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding <[EMAIL PROTECTED]> wrote:
> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, > query:chararray); > > queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0); > dump queries; > > > On 4/22/11 2:25 PM, "Steve Watt" <[EMAIL PROTECTED]> wrote: > > Hi Folks > > I've done a load of a dataset and I am attempting to filter out unwanted > records by checking that one of my tuple fields contains a particular > string. I've distilled this issue down to the sample excite.log that ships > with Pig for easy recreation. I've read through the INDEXOF code and I > think > this should work (lots of queries that contain the word yahoo) but my > queries dump always contains zero records. Can anyone tell me what I am > doing wrong? > > raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, > query); > queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0); > dump queries; > > Regards > Steve Watt > >
-
Re: Pig FILTER with INDEXOF not working
Dmitriy Ryaboy 2011-04-23, 01:06
If the expected return type of your loader is (String, String, String) you should just put Strings into the tuple (no conversion to DataByteArrays) and report your schema to Pig via an implementation of LoadMetadata.getSchema()
D
On Fri, Apr 22, 2011 at 5:30 PM, Steve Watt <[EMAIL PROTECTED]> wrote:
> Richard, if you're coming to OSCON or Hadoop Summit, please let me know so > I > can buy you a beer. Thanks for the help. This now works for with the excite > log using PigStorage(); > > It is however still not working with my custom LoadFunc and data. For > reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache > Nutch Segments that reads in each page that is crawled and represents it as > a Tuple of (Url, ContentType, PageContent) as shown in the script below: > > webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data' > using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray, > content:chararray); > companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0); > dump companies; > > This keeps failing with ERROR 1071: Cannot convert a > generic_writablecomparable to a String. However, If I change the script to > the following (remove schema type & straight dump after load), it works: > > webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data' > using com.hp.demo.SegmentLoader() AS (url, type, content); > dump webcrawl; > > Clearly, as soon as I inject types into the Load Schema it starts bombing. > Can anyone tell me what I am doing wrong? I have attached my Nutch LoadFunc > below for reference: > > public class SegmentLoader extends FileInputLoadFunc { > > private SequenceFileRecordReader<WritableComparable, Content> reader; > protected static final Log LOG = LogFactory.getLog(SegmentLoader.class); > @Override > public void setLocation(String location, Job job) throws IOException { > FileInputFormat.setInputPaths(job, location); > } > @SuppressWarnings("unchecked") > @Override > public InputFormat getInputFormat() throws IOException { > return new SequenceFileInputFormat<WritableComparable, Content>(); > } > > @SuppressWarnings("unchecked") > @Override > public void prepareToRead(RecordReader reader, PigSplit split) throws > IOException { > this.reader = (SequenceFileRecordReader) reader; > } > > @Override > public Tuple getNext() throws IOException { > try { > if (!reader.nextKeyValue()){ > return null; > } > Content value = ((Content)reader.getCurrentValue()); > String url = value.getUrl(); > String type = value.getContentType(); > String content = value.getContent().toString(); > Tuple tuple = TupleFactory.getInstance().newTuple(3); > tuple.set(0, new DataByteArray(url)); > tuple.set(1, new DataByteArray(type)); > tuple.set(2, new DataByteArray(content)); > return tuple; > } catch (InterruptedException e){ > throw new ExecException(e); > } > } > > } > > On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding <[EMAIL PROTECTED]> wrote: > > > raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, > > query:chararray); > > > > queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0); > > dump queries; > > > > > > On 4/22/11 2:25 PM, "Steve Watt" <[EMAIL PROTECTED]> wrote: > > > > Hi Folks > > > > I've done a load of a dataset and I am attempting to filter out unwanted > > records by checking that one of my tuple fields contains a particular > > string. I've distilled this issue down to the sample excite.log that > ships > > with Pig for easy recreation. I've read through the INDEXOF code and I > > think > > this should work (lots of queries that contain the word yahoo) but my > > queries dump always contains zero records. Can anyone tell me what I am > > doing wrong? > > > > raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, > > query); > > queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0); > > dump queries; > > > > Regards > > Steve Watt > > > > >
-
Re: Pig FILTER with INDEXOF not working
Aniket Mokashi 2011-04-23, 01:07
I think the fix is- tuple.set(0, new DataByteArray(url)); to tuple.set(0, url);
Thanks, Aniket
On Fri, April 22, 2011 8:30 pm, Steve Watt wrote: > Richard, if you're coming to OSCON or Hadoop Summit, please let me know > so I can buy you a beer. Thanks for the help. This now works for with the > excite log using PigStorage(); > > It is however still not working with my custom LoadFunc and data. For > reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache > Nutch Segments that reads in each page that is crawled and represents it > as a Tuple of (Url, ContentType, PageContent) as shown in the script > below: > > > webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data' > using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray, > content:chararray); > companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0); dump companies; > > This keeps failing with ERROR 1071: Cannot convert a > generic_writablecomparable to a String. However, If I change the script to > the following (remove schema type & straight dump after load), it works: > > > webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data' > using com.hp.demo.SegmentLoader() AS (url, type, content); dump webcrawl; > > > Clearly, as soon as I inject types into the Load Schema it starts > bombing. Can anyone tell me what I am doing wrong? I have attached my > Nutch LoadFunc > below for reference: > > public class SegmentLoader extends FileInputLoadFunc { > > private SequenceFileRecordReader<WritableComparable, Content> reader; > protected static final Log LOG = LogFactory.getLog(SegmentLoader.class); > @Override > public void setLocation(String location, Job job) throws IOException { > FileInputFormat.setInputPaths(job, location); > } > @SuppressWarnings("unchecked") > @Override > public InputFormat getInputFormat() throws IOException { return new > SequenceFileInputFormat<WritableComparable, Content>(); > } > > > @SuppressWarnings("unchecked") > @Override > public void prepareToRead(RecordReader reader, PigSplit split) throws > IOException { > this.reader = (SequenceFileRecordReader) reader; } > > > @Override > public Tuple getNext() throws IOException { try { if > (!reader.nextKeyValue()){ > return null; } > Content value = ((Content)reader.getCurrentValue()); > String url = value.getUrl(); > String type = value.getContentType(); > String content = value.getContent().toString(); > Tuple tuple = TupleFactory.getInstance().newTuple(3); > tuple.set(0, new DataByteArray(url)); tuple.set(1, new > DataByteArray(type)); > tuple.set(2, new DataByteArray(content)); return tuple; } catch > (InterruptedException e){ > throw new ExecException(e); } > } > > > } > > > On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding <[EMAIL PROTECTED]> > wrote: > > >> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, >> query:chararray); >> >> >> queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0); dump queries; >> >> >> On 4/22/11 2:25 PM, "Steve Watt" <[EMAIL PROTECTED]> wrote: >> >> >> Hi Folks >> >> >> I've done a load of a dataset and I am attempting to filter out >> unwanted records by checking that one of my tuple fields contains a >> particular string. I've distilled this issue down to the sample >> excite.log that ships with Pig for easy recreation. I've read through >> the INDEXOF code and I think this should work (lots of queries that >> contain the word yahoo) but my queries dump always contains zero >> records. Can anyone tell me what I am doing wrong? >> >> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time, >> query); queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0); dump >> queries; >> >> Regards >> Steve Watt >> >> >> >
|
|