Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Filter on contents of other dataset


+
Aniket Mokashi 2011-04-15, 03:21
+
Mridul Muralidharan 2011-04-15, 03:29
Copy link to this message
-
Re: Filter on contents of other dataset
Thanks Mridul,

(Although, small might grow bigger) For instance, lets have small as
in-memory-small stored in a local file.

When does my udf load the data from the file. Earlier, I wrote a bag
loader that returns a bag of small data (eg- load 'smalldata' using
BagLoader() as (smallbag)). But then, I had to write CONTAINSBAG(hdata,
smallbag) to make this work.

I think your solution would solve my problem, but how do I make my udf
read file? Can you give me some pointers?

Thanks,
Aniket
On Thu, April 14, 2011 11:29 pm, Mridul Muralidharan wrote:
>

> The way you described it, it does look like an application of cross.
>
>
> How 'small' is small ?
> If it is pretty small, you can avoid the shuffle/reduce phase and
> directly stream huge through a udf which does a task local cross with
> 'small' (assuming it fits in memory).
>
>
>
> %define my_udf MYUDF('smalldata')
>
>
> huge = load 'mydata' as (hkey:chararray, hdata:chararray); filtered > FILTER huge BY my_udf(hkey, hdata);
>
>
>
>
> Where my_udf returns true if there exists some skey in smalldata for
> which F(hdata, skey) is true - as you defined.
>
>
> Regards,
> Mridul
>
>
> On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote:
>
>> Hi,
>>
>>
>> What would be the best way to write this script?
>> I have two datasets - huge (hkey, hdata), small(skey). I want to filter
>> all the data from huge dataset for which F(hdata, skey) is true. Please
>> advise.
>>
>> For example,
>> huge = load 'mydata' as (key:chararray, value:chararray); small = load
>> 'smalldata' as skey:chararray;
>> h_s_cross = cross huge, small; filtered = foreach h_s_cross generate
>> CONTAINS(value, skey);
>>
>>
>> Thanks,
>> Aniket
>>
>>
>
>
>
+
Mridul Muralidharan 2011-04-15, 03:44
+
Alan Gates 2011-04-15, 16:13
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB