Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig FILTER with INDEXOF not working


Copy link to this message
-
Re: Pig FILTER with INDEXOF not working
I think the fix is-
tuple.set(0, new DataByteArray(url));
to
tuple.set(0, url);

Thanks,
Aniket

On Fri, April 22, 2011 8:30 pm, Steve Watt wrote:
> Richard, if you're coming to OSCON or Hadoop Summit, please let me know
> so I can buy you a beer. Thanks for the help. This now works for with the
> excite log using PigStorage();
>
> It is however still not working with my custom LoadFunc and data. For
> reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
>  Nutch Segments that reads in each page that is crawled and represents it
> as a Tuple of (Url, ContentType, PageContent) as shown in the script
> below:
>
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
>  using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
> content:chararray);
> companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0); dump companies;
>
> This keeps failing with ERROR 1071: Cannot convert a
> generic_writablecomparable to a String. However, If I change the script to
>  the following (remove schema type & straight dump after load), it works:
>
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
>  using com.hp.demo.SegmentLoader() AS (url, type, content); dump webcrawl;
>
>
> Clearly, as soon as I inject types into the Load Schema it starts
> bombing. Can anyone tell me what I am doing wrong? I have attached my
> Nutch LoadFunc
> below for reference:
>
> public class SegmentLoader extends FileInputLoadFunc {
>
> private SequenceFileRecordReader<WritableComparable, Content> reader;
> protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
> @Override
> public void setLocation(String location, Job job) throws IOException {
> FileInputFormat.setInputPaths(job, location);
> }
> @SuppressWarnings("unchecked")
> @Override
> public InputFormat getInputFormat() throws IOException { return new
> SequenceFileInputFormat<WritableComparable, Content>();
> }
>
>
> @SuppressWarnings("unchecked")
> @Override
> public void prepareToRead(RecordReader reader, PigSplit split) throws
> IOException {
> this.reader = (SequenceFileRecordReader) reader; }
>
>
> @Override
> public Tuple getNext() throws IOException { try { if
> (!reader.nextKeyValue()){
> return null; }
> Content value = ((Content)reader.getCurrentValue());
> String url = value.getUrl();
> String type = value.getContentType();
> String content = value.getContent().toString();
> Tuple tuple  = TupleFactory.getInstance().newTuple(3);
> tuple.set(0, new DataByteArray(url)); tuple.set(1, new
> DataByteArray(type));
> tuple.set(2, new DataByteArray(content)); return tuple; } catch
> (InterruptedException e){
> throw new ExecException(e); }
> }
>
>
> }
>
>
> On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding <[EMAIL PROTECTED]>
> wrote:
>
>
>> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
>>  query:chararray);
>>
>>
>> queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0); dump queries;
>>
>>
>> On 4/22/11 2:25 PM, "Steve Watt" <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hi Folks
>>
>>
>> I've done a load of a dataset and I am attempting to filter out
>> unwanted records by checking that one of my tuple fields contains a
>> particular string. I've distilled this issue down to the sample
>> excite.log that ships with Pig for easy recreation. I've read through
>> the INDEXOF code and I think this should work (lots of queries that
>> contain the word yahoo) but my queries dump always contains zero
>> records. Can anyone tell me what I am doing wrong?
>>
>> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
>>  query); queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0); dump
>> queries;
>>
>> Regards
>> Steve Watt
>>
>>
>>
>