Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: Hadoop cluster hangs on big hive job


+
Suresh Srinivas 2013-03-10, 15:16
+
Daning Wang 2013-03-11, 20:32
+
Suresh Srinivas 2013-03-11, 20:52
Copy link to this message
-
Re: Hadoop cluster hangs on big hive job
You mean HDFS-4479?

The log seems to indicate the infamous jetty hang issue (MAPREDUCE-2386)
though.
On Mon, Mar 11, 2013 at 1:52 PM, Suresh Srinivas <[EMAIL PROTECTED]>wrote:

> I have seen one such problem related to big hive jobs that open a lot of
> files. See HDFS-4496 for more details. Snippet from the description:
> The following issue was observed in a cluster that was running a Hive job
> and was writing to 100,000 temporary files (each task is writing to 1000s
> of files). When this job is killed, a large number of files are left open
> for write. Eventually when the lease for open files expires, lease recovery
> is started for all these files in a very short duration of time. This
> causes a large number of commitBlockSynchronization where logSync is
> performed with the FSNamesystem lock held. This overloads the namenode
> resulting in slowdown.
>
> Could this be the cause? Can you see namenode log to see if you have lease
> recovery activity? If not, can you send some information about what is
> happening in the namenode logs at the time of this slowdown?
>
>
>
> On Mon, Mar 11, 2013 at 1:32 PM, Daning Wang <[EMAIL PROTECTED]> wrote:
>
>> [hive@mr3-033 ~]$ hadoop version
>> Hadoop 1.0.4
>> Subversion
>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
>> 1393290
>> Compiled by hortonfo on Wed Oct  3 05:13:58 UTC 2012
>>
>>
>> On Sun, Mar 10, 2013 at 8:16 AM, Suresh Srinivas <[EMAIL PROTECTED]>wrote:
>>
>>> What is the version of hadoop?
>>>
>>> Sent from phone
>>>
>>> On Mar 7, 2013, at 11:53 AM, Daning Wang <[EMAIL PROTECTED]> wrote:
>>>
>>> We have hive query processing zipped csv files. the query was scanning
>>> for 10 days(partitioned by date). data for each day around 130G. The
>>> problem is not consistent since if you run it again, it might go through.
>>> but the problem has never happened on the smaller jobs(like processing only
>>> one days data).
>>>
>>> We don't have space issue.
>>>
>>> I have attached log file when problem happening. it is stuck like
>>> following(just search "19706 of 49964")
>>>
>>> 2013-03-05 15:13:51,587 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>> 2013-03-05 15:13:51,811 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>> 2013-03-05 15:13:52,551 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>> 2013-03-05 15:13:52,760 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>> 2013-03-05 15:13:52,946 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>> 2013-03-05 15:13:54,742 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>>
>>> Thanks,
>>>
>>> Daning
>>>
>>>
>>> On Thu, Mar 7, 2013 at 12:21 AM, Håvard Wahl Kongsgård <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> hadoop logs?
>>>> On 6. mars 2013 21:04, "Daning Wang" <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while
>>>>> running big jobs. Basically all the nodes are dead, from that
>>>>> trasktracker's log looks it went into some kinds of loop forever.
>>>>>
>>>>> All the log entries like this when problem happened.
>>>>>
>>>>> Any idea how to debug the issue?
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>>
>>>>> 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of
+
samir das mohapatra 2013-03-10, 10:08
+
pabbathi venki 2013-03-10, 10:30