Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig Distributed Cache


Copy link to this message
-
Re: Pig Distributed Cache
There are some small different lookup files so that I need to process each
single lookup files. From your example it can be that way:

a = LOAD 'small1'; --for example taking source_id=1 --> then find
source_name
d = LOAD 'small2'; --for example taking campaign_id=2 --> then find
campaign_name
e = LOAD 'small3'; --for example taking offer_id=3 --> then find offer_name
B = LOAD 'big';
C = JOIN B BY 1, A BY 1 USING 'replicated';
f = JOIN c BY 1, d BY 1 USING 'replicated';
g = JOIN f BY 1, e BY 1 USING 'replicated';
dump g;

small1, small2 and small3 is different files so they store different rows.
At the end of the process I need to attach to all rows in my big file.
I know HDFS doesn't perform well with the small files but originally it
stores in different environment. I pull the data from there and load into
HDFS. Anyway because of our architecture I can't change it right now.
Thanks
Best regards...
On Tue, Nov 5, 2013 at 7:43 PM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:

> CROSS is grossly expensive to compute so I’m not surprised that the
> performance is good enough. Are you repeating your LOAD and FILTER op’s for
> every one of your small files? At the end of the day, what is it that
> you’re trying to accomplish? Find the 1 row you’re after and attach to all
> rows in your big file?
>
> In terms of using DistributedCache, if you’re computing the cross product
> of two (and no more than two) relations, AND one of the relations is small
> enough to fit in memory, you can use a replicated JOIN instead which would
> be much more performant.
>
> A = LOAD 'small';
> B = LOAD 'big';
> C = JOIN B BY 1, A BY 1 USING 'replicated';
> dump C;
>
> Note that the smaller relation that will be loaded into memory needs to be
> specified second in the JOIN statement.
>
> Also keep in mind that HDFS doesn't perform well with lots of small files.
> If you're design has (lots of) small files, you might benefit from loading
> that data into some database (e.g. HBase).
>
>
> On Tue, Nov 5, 2013 at 7:29 AM, burakkk <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> > I'm using Pig 0.8.1-cdh3u5. Is there any method to use distributed cache
> > inside Pig?
> >
> > My problem is that: I have lots of small files in hdfs. Let's say 10
> files.
> > Each files contain more than one rows but I need only one row. But there
> > isn't any relationship between each other. So I filter them what I need
> and
> > then join them without any relationship(cross join) This is my workaround
> > solution:
> >
> > a = load(smallFile1) --ex: rows count: 1000
> > b = FILTER a BY myrow=='filter by exp1'
> > c = load(smallFile2) --ex: rows count: 30000
> > d = FILTER c BY myrow2=='filter by exp2'
> > e = CROSS b,d
> > ...
> > f = load(bigFile) --ex:rows count: 50mio
> > g = CROSS e, f
> >
> > But it's performance isn't good enough. So if I can use distributed cache
> > inside pig script, I can lookup the files which I first read and filter
> in
> > the memory. What is your suggestion? Is there any other performance
> > efficient way to do it?
> >
> > Thanks
> > Best regards...
> >
> >
> > --
> >
> > *BURAK ISIKLI* | *http://burakisikli.wordpress.com
> > <http://burakisikli.wordpress.com>*
> >
>

--

*BURAK ISIKLI* | *http://burakisikli.wordpress.com
<http://burakisikli.wordpress.com>*