I am running some data that isn't huge persay, but I performing processing
on it to get into my final table (RCFile).
One of the challenges is that it comes in large blocks of data, for
example, I may have a 70MB chunk of binary data that I want to put in. My
process that generates this data hexes it, so that 70 MB becomes a 140 MB
string of data. Then when I insert into the binary field I use unhex. Now,
my nodes are not huge, I have 8 nodes 6 GB of ram each. A typical load
reads the hex encoded from an external load table, and then inserts it (no
joins etc). Most data loads fine, but when I get chunks above 32 MB in raw
size I'll get failures. I am working getting a some adjustments on my
source data to minimize those large chunks.
That being said, what are somethings I can do at the hive/insert level that
can reduce the heap space issues? I've tried playing with split size,
reusing jvms, and heap space. But it's all trial and error, and I'd like
to have more real world examples of conditions where one settings makes
sense and another does not. I am not looking for a googling here, just
some examples (even links to examples) showing that with this type of data
or setup, you can get (less mem usage, faster performance, etc) by tweaking
these settings. I think my issue is there are so many settings that say
do this or that, and they don't really provide real world examples, it
makes it tough to know where to start.