CROSS is grossly expensive to compute so I’m not surprised that the
performance is good enough. Are you repeating your LOAD and FILTER op’s for
every one of your small files? At the end of the day, what is it that
you’re trying to accomplish? Find the 1 row you’re after and attach to all
rows in your big file?
In terms of using DistributedCache, if you’re computing the cross product
of two (and no more than two) relations, AND one of the relations is small
enough to fit in memory, you can use a replicated JOIN instead which would
be much more performant.
A = LOAD 'small';
B = LOAD 'big';
C = JOIN B BY 1, A BY 1 USING 'replicated';
Note that the smaller relation that will be loaded into memory needs to be
specified second in the JOIN statement.
Also keep in mind that HDFS doesn't perform well with lots of small files.
If you're design has (lots of) small files, you might benefit from loading
that data into some database (e.g. HBase).
On Tue, Nov 5, 2013 at 7:29 AM, burakkk <[EMAIL PROTECTED]> wrote:
> I'm using Pig 0.8.1-cdh3u5. Is there any method to use distributed cache
> inside Pig?
> My problem is that: I have lots of small files in hdfs. Let's say 10 files.
> Each files contain more than one rows but I need only one row. But there
> isn't any relationship between each other. So I filter them what I need and
> then join them without any relationship(cross join) This is my workaround
> a = load(smallFile1) --ex: rows count: 1000
> b = FILTER a BY myrow=='filter by exp1'
> c = load(smallFile2) --ex: rows count: 30000
> d = FILTER c BY myrow2=='filter by exp2'
> e = CROSS b,d
> f = load(bigFile) --ex:rows count: 50mio
> g = CROSS e, f
> But it's performance isn't good enough. So if I can use distributed cache
> inside pig script, I can lookup the files which I first read and filter in
> the memory. What is your suggestion? Is there any other performance
> efficient way to do it?
> Best regards...
> *BURAK ISIKLI* | *http://burakisikli.wordpress.com