Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig Distributed Cache


Copy link to this message
-
Re: Pig Distributed Cache
No as I said before, doing the cross product is my workaround solution. I
try it to do replicated join. I'll share the results soon later.

Thanks
Best regards...
On Tue, Nov 5, 2013 at 9:50 PM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:

> I see... do you have to do a full cross product or are you able to do a
> join?
>
>
> On Tue, Nov 5, 2013 at 11:07 AM, burakkk <[EMAIL PROTECTED]> wrote:
>
> > There are some small different lookup files so that I need to process
> each
> > single lookup files. From your example it can be that way:
> >
> > a = LOAD 'small1'; --for example taking source_id=1 --> then find
> > source_name
> > d = LOAD 'small2'; --for example taking campaign_id=2 --> then find
> > campaign_name
> > e = LOAD 'small3'; --for example taking offer_id=3 --> then find
> offer_name
> > B = LOAD 'big';
> > C = JOIN B BY 1, A BY 1 USING 'replicated';
> > f = JOIN c BY 1, d BY 1 USING 'replicated';
> > g = JOIN f BY 1, e BY 1 USING 'replicated';
> > dump g;
> >
> > small1, small2 and small3 is different files so they store different
> rows.
> > At the end of the process I need to attach to all rows in my big file.
> > I know HDFS doesn't perform well with the small files but originally it
> > stores in different environment. I pull the data from there and load into
> > HDFS. Anyway because of our architecture I can't change it right now.
> >
> >
> > Thanks
> > Best regards...
> >
> >
> > On Tue, Nov 5, 2013 at 7:43 PM, Pradeep Gollakota <[EMAIL PROTECTED]
> > >wrote:
> >
> > > CROSS is grossly expensive to compute so I’m not surprised that the
> > > performance is good enough. Are you repeating your LOAD and FILTER op’s
> > for
> > > every one of your small files? At the end of the day, what is it that
> > > you’re trying to accomplish? Find the 1 row you’re after and attach to
> > all
> > > rows in your big file?
> > >
> > > In terms of using DistributedCache, if you’re computing the cross
> product
> > > of two (and no more than two) relations, AND one of the relations is
> > small
> > > enough to fit in memory, you can use a replicated JOIN instead which
> > would
> > > be much more performant.
> > >
> > > A = LOAD 'small';
> > > B = LOAD 'big';
> > > C = JOIN B BY 1, A BY 1 USING 'replicated';
> > > dump C;
> > >
> > > Note that the smaller relation that will be loaded into memory needs to
> > be
> > > specified second in the JOIN statement.
> > >
> > > Also keep in mind that HDFS doesn't perform well with lots of small
> > files.
> > > If you're design has (lots of) small files, you might benefit from
> > loading
> > > that data into some database (e.g. HBase).
> > >
> > >
> > > On Tue, Nov 5, 2013 at 7:29 AM, burakkk <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > Hi,
> > > > I'm using Pig 0.8.1-cdh3u5. Is there any method to use distributed
> > cache
> > > > inside Pig?
> > > >
> > > > My problem is that: I have lots of small files in hdfs. Let's say 10
> > > files.
> > > > Each files contain more than one rows but I need only one row. But
> > there
> > > > isn't any relationship between each other. So I filter them what I
> need
> > > and
> > > > then join them without any relationship(cross join) This is my
> > workaround
> > > > solution:
> > > >
> > > > a = load(smallFile1) --ex: rows count: 1000
> > > > b = FILTER a BY myrow=='filter by exp1'
> > > > c = load(smallFile2) --ex: rows count: 30000
> > > > d = FILTER c BY myrow2=='filter by exp2'
> > > > e = CROSS b,d
> > > > ...
> > > > f = load(bigFile) --ex:rows count: 50mio
> > > > g = CROSS e, f
> > > >
> > > > But it's performance isn't good enough. So if I can use distributed
> > cache
> > > > inside pig script, I can lookup the files which I first read and
> filter
> > > in
> > > > the memory. What is your suggestion? Is there any other performance
> > > > efficient way to do it?
> > > >
> > > > Thanks
> > > > Best regards...
> > > >
> > > >
> > > > --
> > > >
> > > > *BURAK ISIKLI* | *http://burakisikli.wordpress.com
> > > > <http://burakisikli.wordpress.com>*
*BURAK ISIKLI* | *http://burakisikli.wordpress.com
<http://burakisikli.wordpress.com>*