Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Filter on contents of other dataset


Copy link to this message
-
Re: Filter on contents of other dataset
Aniket Mokashi 2011-04-15, 03:40
Thanks Mridul,

(Although, small might grow bigger) For instance, lets have small as
in-memory-small stored in a local file.

When does my udf load the data from the file. Earlier, I wrote a bag
loader that returns a bag of small data (eg- load 'smalldata' using
BagLoader() as (smallbag)). But then, I had to write CONTAINSBAG(hdata,
smallbag) to make this work.

I think your solution would solve my problem, but how do I make my udf
read file? Can you give me some pointers?

Thanks,
Aniket
On Thu, April 14, 2011 11:29 pm, Mridul Muralidharan wrote:
>

> The way you described it, it does look like an application of cross.
>
>
> How 'small' is small ?
> If it is pretty small, you can avoid the shuffle/reduce phase and
> directly stream huge through a udf which does a task local cross with
> 'small' (assuming it fits in memory).
>
>
>
> %define my_udf MYUDF('smalldata')
>
>
> huge = load 'mydata' as (hkey:chararray, hdata:chararray); filtered > FILTER huge BY my_udf(hkey, hdata);
>
>
>
>
> Where my_udf returns true if there exists some skey in smalldata for
> which F(hdata, skey) is true - as you defined.
>
>
> Regards,
> Mridul
>
>
> On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote:
>
>> Hi,
>>
>>
>> What would be the best way to write this script?
>> I have two datasets - huge (hkey, hdata), small(skey). I want to filter
>> all the data from huge dataset for which F(hdata, skey) is true. Please
>> advise.
>>
>> For example,
>> huge = load 'mydata' as (key:chararray, value:chararray); small = load
>> 'smalldata' as skey:chararray;
>> h_s_cross = cross huge, small; filtered = foreach h_s_cross generate
>> CONTAINS(value, skey);
>>
>>
>> Thanks,
>> Aniket
>>
>>
>
>
>