The transform scripts (or executables) are run as separate processes, so it
sounds like Hive itself is blowing up. That would be consistent with your
script working fine outside Hive. The Hive or Hadoop logs might have clues.
So, it happens consistently with this one file? I would check to be sure
that there isn't a subtle error in the file or the output from your script,
say an extra tab, other whitespace, or a malformed data value. If you can
find the line where it blows up, that would be good. You could have your
script dump debug data, like an index for each input and the corresponding
key-value pair. Or modify the output of the script and the query results to
return information like this to Hive. It seems more likely that the problem
is downstream from when the data passes through the query. So, you could
try changing the Hive query to just dump the script results and do nothing
else afterwards, etc.
However, I wouldn't expect those problems to cause heap exhaustion, unless
it somehow triggers an infinite loop.
Can you share your python script, Hive query, table schema(s), and a sample
of the file?
On Wed, Jan 16, 2013 at 9:32 PM, John Omernik <[EMAIL PROTECTED]> wrote:
> I am perplexed if I run a transform script on a file by itself, it runs
> fine, outputs to standard out life is good. If I run the transform script
> on that same file (with the path and filename being passed into the script
> via transform so that the python script is doing the exact same thing) I
> get a java heap space error. This process works on 99% of files, and I just
> can't figure out why this file is different. How does say a python
> transform script run "in" the java process (if that is even what it is
> doing) so that it causes a heap error in a transform script but not run
> without java around?
> I am curious on what steps I can take to trouble shoot or eliminate this
*Dean Wampler, Ph.D.*