Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig Distributed Cache


Copy link to this message
-
Pig Distributed Cache
Hi,
I'm using Pig 0.8.1-cdh3u5. Is there any method to use distributed cache
inside Pig?

My problem is that: I have lots of small files in hdfs. Let's say 10 files.
Each files contain more than one rows but I need only one row. But there
isn't any relationship between each other. So I filter them what I need and
then join them without any relationship(cross join) This is my workaround
solution:

a = load(smallFile1) --ex: rows count: 1000
b = FILTER a BY myrow=='filter by exp1'
c = load(smallFile2) --ex: rows count: 30000
d = FILTER c BY myrow2=='filter by exp2'
e = CROSS b,d
...
f = load(bigFile) --ex:rows count: 50mio
g = CROSS e, f

But it's performance isn't good enough. So if I can use distributed cache
inside pig script, I can lookup the files which I first read and filter in
the memory. What is your suggestion? Is there any other performance
efficient way to do it?

Thanks
Best regards...
--

*BURAK ISIKLI* | *http://burakisikli.wordpress.com
<http://burakisikli.wordpress.com>*