Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> data loss due to regionserver going down


Copy link to this message
-
Re: data loss due to regionserver going down
When you shutdown the region server, check the master logs to see if
master has detected this condition.
I've seen weird things happen if dns is not setup correctly - so,
check if master (logs & ui) is correctly detecting that the region
server is down after step 2.

--Suraj
2011/7/27 吴限 <[EMAIL PROTECTED]>:
> Just by keep cheking http://master:60010.
> Before Step 2 :
> AddressStart CodeLoadserver4.yun.com:600301311785159202requests=0,
> regions=10, usedHeap=32,
> maxHeap=995server5.yun.com:600301311768553647requests=18,
> regions=7, usedHeap=117, maxHeap=995Total:servers: 2 requests=18,
> regions=17Then
> at Step 2, I shut server4 and wait until the html shows like this:
> AddressStart CodeLoad
>
> server5.yun.com:600301311768553647requests=18, regions=17, usedHeap=117,
> maxHeap=995Total:servers: 2 requests=18, regions=17then I continued the
> following  steps..
>
> 在 2011年7月28日 上午12:40,Chris Tarnas <[EMAIL PROTECTED]>写道:
>
>> That is strange behavior. How long did you wait between Step 2 and 3, and
>> what is the results of running
>>
>> hbase hbck
>>
>> at step 3?
>>
>> -chris
>>
>> On Jul 27, 2011, at 9:23 AM, 吴限 wrote:
>>
>> > Thx for your reply. But actually later I did another experiment similar
>> to
>> > one which I explained earlier.
>> > Step 1: I inserted some data into the hbase.
>> > Step 2: I shut one of the region servers.
>> > Step 3 : I checked the table and found some data had been lost.
>> > Step 4: I disabled the table and then enabled the table
>> > Step 5 : I checked again and found nothing lost.
>> >
>> > If some data didn't exist in the other region server, then how can u
>> explain
>> > this?
>> >
>> > Hope to get ur reply.Thx~
>> >
>> > 2011/7/28 Chris Tarnas <[EMAIL PROTECTED]>
>> >
>> >> Replication of 1x means no replication. 2x would mean the data exists in
>> >> two locations (what it looks like you want). Running with a replication
>> of
>> >> 1x is a very bad idea and is pretty much a guaranteed way to get data
>> loss.
>> >>
>> >> -chris
>> >>
>> >> On Jul 27, 2011, at 8:58 AM, 吴限 wrote:
>> >>
>> >>> Hi everyone. I'd like to run the following *data* *loss* scenario by
>> you
>> >> to
>> >>> see if
>> >>> we are doing something obviously wrong with our setup here.
>> >>>
>> >>> Setup:
>> >>>  -cdh3u0
>> >>>  - Hadoop 0.20.2
>> >>>  - HBase 0.90.1
>> >>>  - 1 Master Node running as NameNode & JobTracker
>> >>>  -zookeeper quorum
>> >>>  - 2 child nodes running as Datanode, TaskTracker and RegionServer each
>> >>>  - dfs.replication is set to 1
>> >>>
>> >>> First, I inserted some data into the hbase a few hours ago.
>> >>> Then after a while. I rebooted one of the region servers and waited
>> until
>> >>> the master responded to that. However, after I checked the table using
>> >> hbase
>> >>> shell (I used the "count" command), I noticed that there was a huge
>> >> amount
>> >>> of data being lost.
>> >>> After I restarted the regionserver which I had rebooted and checked
>> >> again,
>> >>> I found that some of the missing data was got back but there still
>> >> existed
>> >>> some data which hadn't been found yet.
>> >>> At last,after I disabled the table and then enabled the table , I found
>> >> that
>> >>> all data was stored in the cluster and there was no data that was lost.
>> >>>
>> >>> This is problematic since we are supposed to
>> >>> replicate at x1, so at least one other node should be able to
>> >>> theoretically serve the *data* that the downed regionserver can't.
>> >>>
>> >>> Questions:
>> >>>
>> >>>  - How can you guys explain this weird situation?
>> >>>  - Are there way to recover such lost *data*?
>> >>>
>> >>> Any tips here are definitely appreciated. I'll be happy to provide more
>> >>> information as well.-0
>> >>
>> >>
>>
>>
>