Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Sync latency


Copy link to this message
-
Re: Sync latency
Todd Lipcon 2012-04-09, 17:15
Hi Placido,

Check dmesg for scsi controller issues on all the nodes? Sometimes
dead/dying disks, or bad firmware can cause 30+ second pauses

-Todd

On Mon, Apr 9, 2012 at 1:47 AM, Placido Revilla
<[EMAIL PROTECTED]> wrote:
> Sorry, that's not the problem. In my logs block reporting never takes more
> than 50 ms to process, even when I'm experiencing sync pauses of 30 seconds.
>
> The dataset is currently small (1.2 TB), as the cluster has been running
> live for a couple of months only and I have only slightly over 11K blocks
> in total, that's why block reporting takes little time.
>
> On Thu, Apr 5, 2012 at 8:16 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
>
>> Hi Placido,
>>
>> Sounds like it might be related to HDFS-2379. Try updating to Hadoop
>> 1.0.1 or CDH3u3 and you'll get a fix for that.
>>
>> You can verify by grepping for "BlockReport" in your DN logs - if the
>> pauses on the hbase side correlate with long block reports on the DNs,
>> the upgrade should fix it.
>>
>> -Todd
>>
>> On Wed, Apr 4, 2012 at 2:30 AM, Placido Revilla
>> <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> >
>> > I'm having a problem with sync latency on our HBase cluster. Our cluster
>> is
>> > composed of 2 NN (and HBase master) machines and 12 DN (and HBase
>> > regionservers and thrift servers). We are having several issues a day
>> where
>> > the cluster seems to halt all processing during several seconds, and
>> these
>> > times are aligned with some WARN logs:
>> >
>> > 13:23:55,285 WARN  [IPC Server handler 62 on 60020] wal.HLog IPC Server
>> > handler 62 on 60020 took 10713 ms appending an edit to hlog;
>> > editcount=150694, len~=58.0
>> > 13:23:55,286 WARN  [IPC Server handler 64 on 60020] wal.HLog IPC Server
>> > handler 64 on 60020 took 10726 ms appending an edit to hlog;
>> > editcount=319217, len~=47.0
>> > 13:23:55,286 WARN  [IPC Server handler 118 on 60020] wal.HLog IPC Server
>> > handler 118 on 60020 took 10741 ms appending an edit to hlog;
>> > editcount=373337, len~=49.0
>> > 13:23:55,286 WARN  [IPC Server handler 113 on 60020] wal.HLog IPC Server
>> > handler 113 on 60020 took 10746 ms appending an edit to hlog;
>> > editcount=57912, len~=45.0
>> > 15:39:38,193 WARN  [IPC Server handler 94 on 60020] wal.HLog IPC Server
>> > handler 94 on 60020 took 21787 ms appending an edit to hlog;
>> > editcount=2901, len~=45.0
>> > 15:39:38,194 WARN  [IPC Server handler 82 on 60020] wal.HLog IPC Server
>> > handler 82 on 60020 took 21784 ms appending an edit to hlog;
>> > editcount=29944, len~=46.0
>> > 16:09:38,201 WARN  [IPC Server handler 78 on 60020] wal.HLog IPC Server
>> > handler 78 on 60020 took 10321 ms appending an edit to hlog;
>> > editcount=163998, len~=104.0
>> > 16:09:38,203 WARN  [IPC Server handler 97 on 60020] wal.HLog IPC Server
>> > handler 97 on 60020 took 10205 ms appending an edit to hlog;
>> > editcount=149497, len~=60.0
>> > 16:09:38,203 WARN  [IPC Server handler 68 on 60020] wal.HLog IPC Server
>> > handler 68 on 60020 took 10199 ms appending an edit to hlog;
>> > editcount=318268, len~=63.0
>> > 16:09:38,203 WARN  [IPC Server handler 120 on 60020] wal.HLog IPC Server
>> > handler 120 on 60020 took 10211 ms appending an edit to hlog;
>> > editcount=88001, len~=45.0
>> > 16:09:38,204 WARN  [IPC Server handler 88 on 60020] wal.HLog IPC Server
>> > handler 88 on 60020 took 10235 ms appending an edit to hlog;
>> > editcount=141516, len~=100.0
>> >
>> > The machines in the cluster are pretty powerful (8 HT cores, 48 GB RAM, 6
>> > SATA 7200 RPM disks) so we are not peaking on any hardware, for example
>> CPU
>> > is never over 20% used (avg 5%), network BW is never over 100 Mbps (with
>> 1
>> > Gbps links) and 10k packets/s in each RS, RAM is 50% free (used for disk
>> > cache) and random IOPS is well under 120 per sec (we should be able to
>> > stand over 600). We have also monitored the GC for pauses (we have 16GB
>> of
>> > heap for the region servers) and we don't see pauses of more than a
>> couple
>> > tens of milliseconds (and concurrent sweeps of more than 5 or 6 seconds).

Todd Lipcon
Software Engineer, Cloudera