Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Pig Distributed Cache


Copy link to this message
-
Re: Pig Distributed Cache
Pradeep Gollakota 2013-11-05, 19:50
I see... do you have to do a full cross product or are you able to do a
join?
On Tue, Nov 5, 2013 at 11:07 AM, burakkk <[EMAIL PROTECTED]> wrote:

> There are some small different lookup files so that I need to process each
> single lookup files. From your example it can be that way:
>
> a = LOAD 'small1'; --for example taking source_id=1 --> then find
> source_name
> d = LOAD 'small2'; --for example taking campaign_id=2 --> then find
> campaign_name
> e = LOAD 'small3'; --for example taking offer_id=3 --> then find offer_name
> B = LOAD 'big';
> C = JOIN B BY 1, A BY 1 USING 'replicated';
> f = JOIN c BY 1, d BY 1 USING 'replicated';
> g = JOIN f BY 1, e BY 1 USING 'replicated';
> dump g;
>
> small1, small2 and small3 is different files so they store different rows.
> At the end of the process I need to attach to all rows in my big file.
> I know HDFS doesn't perform well with the small files but originally it
> stores in different environment. I pull the data from there and load into
> HDFS. Anyway because of our architecture I can't change it right now.
>
>
> Thanks
> Best regards...
>
>
> On Tue, Nov 5, 2013 at 7:43 PM, Pradeep Gollakota <[EMAIL PROTECTED]
> >wrote:
>
> > CROSS is grossly expensive to compute so I’m not surprised that the
> > performance is good enough. Are you repeating your LOAD and FILTER op’s
> for
> > every one of your small files? At the end of the day, what is it that
> > you’re trying to accomplish? Find the 1 row you’re after and attach to
> all
> > rows in your big file?
> >
> > In terms of using DistributedCache, if you’re computing the cross product
> > of two (and no more than two) relations, AND one of the relations is
> small
> > enough to fit in memory, you can use a replicated JOIN instead which
> would
> > be much more performant.
> >
> > A = LOAD 'small';
> > B = LOAD 'big';
> > C = JOIN B BY 1, A BY 1 USING 'replicated';
> > dump C;
> >
> > Note that the smaller relation that will be loaded into memory needs to
> be
> > specified second in the JOIN statement.
> >
> > Also keep in mind that HDFS doesn't perform well with lots of small
> files.
> > If you're design has (lots of) small files, you might benefit from
> loading
> > that data into some database (e.g. HBase).
> >
> >
> > On Tue, Nov 5, 2013 at 7:29 AM, burakkk <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > > I'm using Pig 0.8.1-cdh3u5. Is there any method to use distributed
> cache
> > > inside Pig?
> > >
> > > My problem is that: I have lots of small files in hdfs. Let's say 10
> > files.
> > > Each files contain more than one rows but I need only one row. But
> there
> > > isn't any relationship between each other. So I filter them what I need
> > and
> > > then join them without any relationship(cross join) This is my
> workaround
> > > solution:
> > >
> > > a = load(smallFile1) --ex: rows count: 1000
> > > b = FILTER a BY myrow=='filter by exp1'
> > > c = load(smallFile2) --ex: rows count: 30000
> > > d = FILTER c BY myrow2=='filter by exp2'
> > > e = CROSS b,d
> > > ...
> > > f = load(bigFile) --ex:rows count: 50mio
> > > g = CROSS e, f
> > >
> > > But it's performance isn't good enough. So if I can use distributed
> cache
> > > inside pig script, I can lookup the files which I first read and filter
> > in
> > > the memory. What is your suggestion? Is there any other performance
> > > efficient way to do it?
> > >
> > > Thanks
> > > Best regards...
> > >
> > >
> > > --
> > >
> > > *BURAK ISIKLI* | *http://burakisikli.wordpress.com
> > > <http://burakisikli.wordpress.com>*
> > >
> >
>
>
>
> --
>
> *BURAK ISIKLI* | *http://burakisikli.wordpress.com
> <http://burakisikli.wordpress.com>*
>