Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig FILTER with INDEXOF not working


Copy link to this message
-
Re: Pig FILTER with INDEXOF not working
I think the fix is-
tuple.set(0, new DataByteArray(url));
to
tuple.set(0, url);

Thanks,
Aniket

On Fri, April 22, 2011 8:30 pm, Steve Watt wrote:
> Richard, if you're coming to OSCON or Hadoop Summit, please let me know
> so I can buy you a beer. Thanks for the help. This now works for with the
> excite log using PigStorage();
>
> It is however still not working with my custom LoadFunc and data. For
> reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
>  Nutch Segments that reads in each page that is crawled and represents it
> as a Tuple of (Url, ContentType, PageContent) as shown in the script
> below:
>
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
>  using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
> content:chararray);
> companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0); dump companies;
>
> This keeps failing with ERROR 1071: Cannot convert a
> generic_writablecomparable to a String. However, If I change the script to
>  the following (remove schema type & straight dump after load), it works:
>
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
>  using com.hp.demo.SegmentLoader() AS (url, type, content); dump webcrawl;
>
>
> Clearly, as soon as I inject types into the Load Schema it starts
> bombing. Can anyone tell me what I am doing wrong? I have attached my
> Nutch LoadFunc
> below for reference:
>
> public class SegmentLoader extends FileInputLoadFunc {
>
> private SequenceFileRecordReader<WritableComparable, Content> reader;
> protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
> @Override
> public void setLocation(String location, Job job) throws IOException {
> FileInputFormat.setInputPaths(job, location);
> }
> @SuppressWarnings("unchecked")
> @Override
> public InputFormat getInputFormat() throws IOException { return new
> SequenceFileInputFormat<WritableComparable, Content>();
> }
>
>
> @SuppressWarnings("unchecked")
> @Override
> public void prepareToRead(RecordReader reader, PigSplit split) throws
> IOException {
> this.reader = (SequenceFileRecordReader) reader; }
>
>
> @Override
> public Tuple getNext() throws IOException { try { if
> (!reader.nextKeyValue()){
> return null; }
> Content value = ((Content)reader.getCurrentValue());
> String url = value.getUrl();
> String type = value.getContentType();
> String content = value.getContent().toString();
> Tuple tuple  = TupleFactory.getInstance().newTuple(3);
> tuple.set(0, new DataByteArray(url)); tuple.set(1, new
> DataByteArray(type));
> tuple.set(2, new DataByteArray(content)); return tuple; } catch
> (InterruptedException e){
> throw new ExecException(e); }
> }
>
>
> }
>
>
> On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding <[EMAIL PROTECTED]>
> wrote:
>
>
>> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
>>  query:chararray);
>>
>>
>> queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0); dump queries;
>>
>>
>> On 4/22/11 2:25 PM, "Steve Watt" <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hi Folks
>>
>>
>> I've done a load of a dataset and I am attempting to filter out
>> unwanted records by checking that one of my tuple fields contains a
>> particular string. I've distilled this issue down to the sample
>> excite.log that ships with Pig for easy recreation. I've read through
>> the INDEXOF code and I think this should work (lots of queries that
>> contain the word yahoo) but my queries dump always contains zero
>> records. Can anyone tell me what I am doing wrong?
>>
>> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
>>  query); queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0); dump
>> queries;
>>
>> Regards
>> Steve Watt
>>
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB