Hive queries are compiled to different types tasks (MapReduce, MoveTask, etc), so a successful MR task as indicated in the JT doesn't mean the whole query succeeded. So you need to examine the status of the hive query to see if it succeeded or not. You can also check the hive's log file under /tmp/<user>/hive.log to debug if a query failed.
Also the reason of a broken pipe errors are mostly due to the fact that the script crashed during the mapreduce job. In this case the MR job should fail, as well as the whole Hive query.
On May 11, 2011, at 2:16 PM, Tim Spence wrote:
> I've been using Hive in production for two months now. We're mainly using it for processing server logs, about 1-2GB per day (2-2.5 million requests). Typically we import a day's worth of logs at once. That said, sometimes we decide to tweak a calculated column. When that happens, we modify our transformation script and re-import the entire set of logs (~200 days) into ~600 partitions.
> A few days ago I noticed that simple queries, such as a count of page views over a given week, were giving results up to 10% higher than they yielded just a week before. I suspected that we may have "found" unprocessed log files, so I set up a script to re-import the entire inventory of logs and re-run the queries. I got identical results for some weeks, but different results for some errors. I repeated this experiment and got different results.
> In the course of this I found that sometimes Hive will create all of the partitions but write no data to them while not reporting any errors in the job tracker. Other times it will fail and leave a stack trace blaming a broken pipe.
> Does anyone have any ideas what I may be doing wrong? I can change our practices whichever way; all I want is confidence that all of my data has been properly imported.