Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Best Practice: lookup table

Markus Resch 2012-03-27, 14:30
Copy link to this message
Re: Best Practice: lookup table
Hi Markus,

I would start with a "replicated" join:

join InputTable by BrowserId, BrowserLookup by Id USING 'replicated';

The idea is to perform a map-side join by loading the smaller
relation, in this case BrowserLookup, into memory.
If all you're doing is lookup, then the replicated join is likely to
be more efficient than dispatching a udf call for each row.



On Tue, Mar 27, 2012 at 10:30 AM, Markus Resch <[EMAIL PROTECTED]> wrote:
> Hi Folks,
> I have another simple question where I need your experience:
> I have a tuple containing lets say browser ids:
> Another table maps those ids to the names of that browsers as human
> readable strings
> (e.g. 1=Firefox, 2=Iceweasel ,3=Blue\ Monster,...)
> I would like to know what would be the best practice to perform this
> lookup:
> My first idea was to write an udf which will do something like
>        String getBrowser(int)
> And call this within a for each.
> Another approach would be to do a
> join InputTable by BrowserId, BrowserLookup by Id;
> Do you have other ideas? I red about "distributed caches" do they make
> sense in this context and how do I use them?
> Thanks a lot
> Markus