Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - data loss due to regionserver going down


Copy link to this message
-
Re: data loss due to regionserver going down
Suraj Varma 2011-07-27, 17:29
When you shutdown the region server, check the master logs to see if
master has detected this condition.
I've seen weird things happen if dns is not setup correctly - so,
check if master (logs & ui) is correctly detecting that the region
server is down after step 2.

--Suraj
2011/7/27 吴限 <[EMAIL PROTECTED]>:
> Just by keep cheking http://master:60010.
> Before Step 2 :
> AddressStart CodeLoadserver4.yun.com:600301311785159202requests=0,
> regions=10, usedHeap=32,
> maxHeap=995server5.yun.com:600301311768553647requests=18,
> regions=7, usedHeap=117, maxHeap=995Total:servers: 2 requests=18,
> regions=17Then
> at Step 2, I shut server4 and wait until the html shows like this:
> AddressStart CodeLoad
>
> server5.yun.com:600301311768553647requests=18, regions=17, usedHeap=117,
> maxHeap=995Total:servers: 2 requests=18, regions=17then I continued the
> following  steps..
>
> 在 2011年7月28日 上午12:40,Chris Tarnas <[EMAIL PROTECTED]>写道:
>
>> That is strange behavior. How long did you wait between Step 2 and 3, and
>> what is the results of running
>>
>> hbase hbck
>>
>> at step 3?
>>
>> -chris
>>
>> On Jul 27, 2011, at 9:23 AM, 吴限 wrote:
>>
>> > Thx for your reply. But actually later I did another experiment similar
>> to
>> > one which I explained earlier.
>> > Step 1: I inserted some data into the hbase.
>> > Step 2: I shut one of the region servers.
>> > Step 3 : I checked the table and found some data had been lost.
>> > Step 4: I disabled the table and then enabled the table
>> > Step 5 : I checked again and found nothing lost.
>> >
>> > If some data didn't exist in the other region server, then how can u
>> explain
>> > this?
>> >
>> > Hope to get ur reply.Thx~
>> >
>> > 2011/7/28 Chris Tarnas <[EMAIL PROTECTED]>
>> >
>> >> Replication of 1x means no replication. 2x would mean the data exists in
>> >> two locations (what it looks like you want). Running with a replication
>> of
>> >> 1x is a very bad idea and is pretty much a guaranteed way to get data
>> loss.
>> >>
>> >> -chris
>> >>
>> >> On Jul 27, 2011, at 8:58 AM, 吴限 wrote:
>> >>
>> >>> Hi everyone. I'd like to run the following *data* *loss* scenario by
>> you
>> >> to
>> >>> see if
>> >>> we are doing something obviously wrong with our setup here.
>> >>>
>> >>> Setup:
>> >>>  -cdh3u0
>> >>>  - Hadoop 0.20.2
>> >>>  - HBase 0.90.1
>> >>>  - 1 Master Node running as NameNode & JobTracker
>> >>>  -zookeeper quorum
>> >>>  - 2 child nodes running as Datanode, TaskTracker and RegionServer each
>> >>>  - dfs.replication is set to 1
>> >>>
>> >>> First, I inserted some data into the hbase a few hours ago.
>> >>> Then after a while. I rebooted one of the region servers and waited
>> until
>> >>> the master responded to that. However, after I checked the table using
>> >> hbase
>> >>> shell (I used the "count" command), I noticed that there was a huge
>> >> amount
>> >>> of data being lost.
>> >>> After I restarted the regionserver which I had rebooted and checked
>> >> again,
>> >>> I found that some of the missing data was got back but there still
>> >> existed
>> >>> some data which hadn't been found yet.
>> >>> At last,after I disabled the table and then enabled the table , I found
>> >> that
>> >>> all data was stored in the cluster and there was no data that was lost.
>> >>>
>> >>> This is problematic since we are supposed to
>> >>> replicate at x1, so at least one other node should be able to
>> >>> theoretically serve the *data* that the downed regionserver can't.
>> >>>
>> >>> Questions:
>> >>>
>> >>>  - How can you guys explain this weird situation?
>> >>>  - Are there way to recover such lost *data*?
>> >>>
>> >>> Any tips here are definitely appreciated. I'll be happy to provide more
>> >>> information as well.-0
>> >>
>> >>
>>
>>
>