|
|
-
Filter on contents of other dataset
Aniket Mokashi 2011-04-15, 03:21
Hi,
What would be the best way to write this script? I have two datasets - huge (hkey, hdata), small(skey). I want to filter all the data from huge dataset for which F(hdata, skey) is true. Please advise.
For example, huge = load 'mydata' as (key:chararray, value:chararray); small = load 'smalldata' as skey:chararray; h_s_cross = cross huge, small; filtered = foreach h_s_cross generate CONTAINS(value, skey);
Thanks, Aniket
+
Aniket Mokashi 2011-04-15, 03:21
-
Re: Filter on contents of other dataset
Mridul Muralidharan 2011-04-15, 03:29
The way you described it, it does look like an application of cross.
How 'small' is small ? If it is pretty small, you can avoid the shuffle/reduce phase and directly stream huge through a udf which does a task local cross with 'small' (assuming it fits in memory). %define my_udf MYUDF('smalldata')
huge = load 'mydata' as (hkey:chararray, hdata:chararray); filtered = FILTER huge BY my_udf(hkey, hdata);
Where my_udf returns true if there exists some skey in smalldata for which F(hdata, skey) is true - as you defined. Regards, Mridul
On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote: > Hi, > > What would be the best way to write this script? > I have two datasets - huge (hkey, hdata), small(skey). I want to filter > all the data from huge dataset for which F(hdata, skey) is true. > Please advise. > > For example, > huge = load 'mydata' as (key:chararray, value:chararray); > small = load 'smalldata' as skey:chararray; > h_s_cross = cross huge, small; > filtered = foreach h_s_cross generate CONTAINS(value, skey); > > Thanks, > Aniket >
+
Mridul Muralidharan 2011-04-15, 03:29
-
Re: Filter on contents of other dataset
Aniket Mokashi 2011-04-15, 03:40
Thanks Mridul,
(Although, small might grow bigger) For instance, lets have small as in-memory-small stored in a local file.
When does my udf load the data from the file. Earlier, I wrote a bag loader that returns a bag of small data (eg- load 'smalldata' using BagLoader() as (smallbag)). But then, I had to write CONTAINSBAG(hdata, smallbag) to make this work.
I think your solution would solve my problem, but how do I make my udf read file? Can you give me some pointers?
Thanks, Aniket On Thu, April 14, 2011 11:29 pm, Mridul Muralidharan wrote: >
> The way you described it, it does look like an application of cross. > > > How 'small' is small ? > If it is pretty small, you can avoid the shuffle/reduce phase and > directly stream huge through a udf which does a task local cross with > 'small' (assuming it fits in memory). > > > > %define my_udf MYUDF('smalldata') > > > huge = load 'mydata' as (hkey:chararray, hdata:chararray); filtered > FILTER huge BY my_udf(hkey, hdata); > > > > > Where my_udf returns true if there exists some skey in smalldata for > which F(hdata, skey) is true - as you defined. > > > Regards, > Mridul > > > On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote: > >> Hi, >> >> >> What would be the best way to write this script? >> I have two datasets - huge (hkey, hdata), small(skey). I want to filter >> all the data from huge dataset for which F(hdata, skey) is true. Please >> advise. >> >> For example, >> huge = load 'mydata' as (key:chararray, value:chararray); small = load >> 'smalldata' as skey:chararray; >> h_s_cross = cross huge, small; filtered = foreach h_s_cross generate >> CONTAINS(value, skey); >> >> >> Thanks, >> Aniket >> >> > > >
+
Aniket Mokashi 2011-04-15, 03:40
-
Re: Filter on contents of other dataset
Mridul Muralidharan 2011-04-15, 03:44
You could either distribute the small file using distributed cache - in which case, you can use direct file api to load content from the file, or directly use hdfs api's to load from each task ... usually distributed cache should work better, but ymmv ! Regards, Mridul
On Friday 15 April 2011 09:10 AM, Aniket Mokashi wrote: > Thanks Mridul, > > (Although, small might grow bigger) For instance, lets have small as > in-memory-small stored in a local file. > > When does my udf load the data from the file. Earlier, I wrote a bag > loader that returns a bag of small data (eg- load 'smalldata' using > BagLoader() as (smallbag)). But then, I had to write CONTAINSBAG(hdata, > smallbag) to make this work. > > I think your solution would solve my problem, but how do I make my udf > read file? Can you give me some pointers? > > Thanks, > Aniket > > > On Thu, April 14, 2011 11:29 pm, Mridul Muralidharan wrote: >> > >> The way you described it, it does look like an application of cross. >> >> >> How 'small' is small ? >> If it is pretty small, you can avoid the shuffle/reduce phase and >> directly stream huge through a udf which does a task local cross with >> 'small' (assuming it fits in memory). >> >> >> >> %define my_udf MYUDF('smalldata') >> >> >> huge = load 'mydata' as (hkey:chararray, hdata:chararray); filtered >> FILTER huge BY my_udf(hkey, hdata); >> >> >> >> >> Where my_udf returns true if there exists some skey in smalldata for >> which F(hdata, skey) is true - as you defined. >> >> >> Regards, >> Mridul >> >> >> On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote: >> >>> Hi, >>> >>> >>> What would be the best way to write this script? >>> I have two datasets - huge (hkey, hdata), small(skey). I want to filter >>> all the data from huge dataset for which F(hdata, skey) is true. Please >>> advise. >>> >>> For example, >>> huge = load 'mydata' as (key:chararray, value:chararray); small = load >>> 'smalldata' as skey:chararray; >>> h_s_cross = cross huge, small; filtered = foreach h_s_cross generate >>> CONTAINS(value, skey); >>> >>> >>> Thanks, >>> Aniket >>> >>> >> >> >> > >
+
Mridul Muralidharan 2011-04-15, 03:44
-
Re: Filter on contents of other dataset
Alan Gates 2011-04-15, 16:13
Is your comparison function equals or is there some transformation that could be applied to hdata and skey so it could be equals? If so you could use semi join instead, which should be much more efficient.
Alan.
On Apr 14, 2011, at 8:21 PM, Aniket Mokashi wrote:
> Hi, > > What would be the best way to write this script? > I have two datasets - huge (hkey, hdata), small(skey). I want to > filter > all the data from huge dataset for which F(hdata, skey) is true. > Please advise. > > For example, > huge = load 'mydata' as (key:chararray, value:chararray); > small = load 'smalldata' as skey:chararray; > h_s_cross = cross huge, small; > filtered = foreach h_s_cross generate CONTAINS(value, skey); > > Thanks, > Aniket >
+
Alan Gates 2011-04-15, 16:13
|
|