Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Extremely long flush times


Copy link to this message
-
Re: Extremely long flush times
Jean-Daniel Cryans 2012-08-10, 22:46
Can you jstack while this happens? Anything weird in the datanode log?
Is the load high during that time?

Thx,

J-D

On Fri, Aug 10, 2012 at 3:12 PM, Espinoza,Carlos <[EMAIL PROTECTED]> wrote:
> We were able to replicate this behavior in a pseudo-distributed hbase
> (hbase-0.94.1) environment. We wrote a test program that creates a test
> table "MyTestTable" and populates it with random rows, then it creates a
> row with 60,000 columns and repeatedly updates it. Each column has a 18
> byte qualifier and a 50 byte value. In our tests, when we ran the
> program, we usually never got beyond 15 updates before it would flush
> for a really long time. The rows that are being updated are about 4MB
> each (minues any hbase metadata).
>
> It doesn't seem like it's caused by GC. I turned on gc logging, and
> didn't see any long pauses. This is the gc log during the flush.
> http://pastebin.com/vJKKXDx5
>
> This is the regionserver log with debug on during the same flush
> http://pastebin.com/Fh5213mg
>
> This is the test program we wrote.
> http://pastebin.com/aZ0k5tx2
>
> You should be able to just compile it, and run it against a running
> HBase cluster.
> $ java TestTable
>
> Carlos
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
> Jean-Daniel Cryans
> Sent: Monday, August 06, 2012 8:04 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Extremely long flush times
>
> Hi Ron,
>
> I think this is caused by GC, for example there's 7 minutes between the
> last blocking message and the end of the flush:
>
> 2012-08-06 11:44:36,348 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC
> Server handler 42 on 9009' on region
> Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6.:
> memstore size 271.9m is >= than blocking 256.0m size
> 2012-08-06 11:51:28,139 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of
> ~155.4m/162905016, currentsize=116.5m/122172880 for region
> Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in
> 501337ms, sequenceid=1792264092, compaction requested=true
>
> During that time I would expect that clients got socket timeouts, tried
> to write again, and data would just pile up in the RPC queue.
>
> Do you have the gc log handy? Also did you set the zookeeper session
> timeout higher than the default of 3 minutes?
>
> Thx,
>
> J-D
>
> On Mon, Aug 6, 2012 at 2:37 PM, Buckley,Ron <[EMAIL PROTECTED]> wrote:
>> We ran into this situation today on our production instance, I'm
>> wondering if anyone has seen or knows a way around it.
>>
>>
>>
>> We were repeatedly updating the same 60,000 column wide row.  With
>> each update, we were updating nearly every column of the row. Each
>> update generated nearly 9 MB in the WAL.
>>
>> This worked OK most of the time. We saw frequent messages like this in
>
>> the region server log, indicating a 2.6 second flush time:
>>
>> 2012-08-06 11:42:52,316 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
>> of ~155.4m/162931400, currentsize=2.7k/2760 for region
>> Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in
>> 2610ms, sequenceid=1792263953, compaction requested=false
>>
>>
>>
>> However, sometimes we would see a flush times around 8 - 10 Minutes.
>>
>> 2012-08-06 11:51:28,139 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
>> of ~155.4m/162905016, currentsize=116.5m/122172880 for region
>> Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in
>> 501337ms, sequenceid=1792264092, compaction requested=true
>>
>>
>>
>> During these long flushes, we would end up taking all the ipc handler
>> slots into that region server that we had configured (50).  It looked
>> to our application like the region server was unavailable.
>>
>>
>>
>> After the long delay, the region server would continue running fine,
>> but then do it again a few minutes later.
>>
>>
>>
>> I thought it was most likely something with Garbage Collection, so I