Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Timestamp as a key good practice?

Copy link to this message
Re: Timestamp as a key good practice?
If you have a really small cluster...
You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a single node. (Secondary too)
Then you have Data Nodes that run DN, TT, and RS.

That would solve any ZK RS problems.

On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote:

> Hi Mike, Hi Rob,
> Thanks for your replies and advices. Seems that now I'm due for some
> implementation. I'm readgin Lars' book first and when I will be done I
> will start with the coding.
> I already have my Zookeeper/Hadoop/HBase running and based on the
> first pages I read, I already know it's not well done since I have put
> a DataNode and a Zookeeper server on ALL the servers ;) So. More
> reading for me for the next few days, and then I will start.
> Thanks again!
> JM
> 2012/6/16, Rob Verkuylen <[EMAIL PROTECTED]>:
>> Just to add from my experiences:
>> Yes hotspotting is bad, but so are devops headaches. A reasonable machine
>> can handle 3-4000 puts a second with ease, and a simple timerange scan can
>> give you the records you need. I have my doubts you will be hitting these
>> amounts anytime soon. A simple setup will get your PoC and then scale when
>> you need to scale.
>> Rob
>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel
>> <[EMAIL PROTECTED]>wrote:
>>> Jean-Marc,
>>> You indicated that you didn't want to do full table scans when you want
>>> to
>>> find out which files hadn't been touched since X time has past.
>>> (X could be months, weeks, days, hours, etc ...)
>>> So here's the thing.
>>> First,  I am not convinced that you will have hot spotting.
>>> Second, you end up having to now do 26 scans instead of one. Then you
>>> need
>>> to join the result set.
>>> Not really a good solution if you think about it.
>>> Oh and I don't believe that you will be hitting a single region, although
>>> you may hit  a region hard.
>>> (Your second table's key is on the timestamp of the last update to the
>>> file.  If the file hadn't been touched in a week, there's the probability
>>> that at scale, it won't be in the same region as a file that had recently
>>> been touched. )
>>> I wouldn't recommend HBaseWD. Its cute, its not novel,  and can only be
>>> applied on a subset of problems.
>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.)
>>> HTH
>>> -Mike
>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote:
>>>> Let's imagine the timestamp is "123456789".
>>>> If I salt it with later from 'a' to 'z' them it will always be split
>>>> between few RegionServers. I will have like "t123456789". The issue is
>>>> that I will have to do 26 queries to be able to find all the entries.
>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B,
>>>> and so on.
>>>> So what's worst? Am I better to deal with the hotspotting? Salt the
>>>> key myself? Or what if I use something like HBaseWD?
>>>> JM
>>>> 2012/6/16, Michel Segel <[EMAIL PROTECTED]>:
>>>>> You can't salt the key in the second table.
>>>>> By salting the key, you lose the ability to do range scans, which is
>>> what
>>>>> you want to do.
>>>>> Sent from a remote device. Please excuse any typos...
>>>>> Mike Segel
>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <
>>>>> wrote:
>>>>>> Thanks all for your comments and suggestions. Regarding the
>>>>>> hotspotting I will try to salt the key in the 2nd table and see the
>>>>>> results.
>>>>>> Yesterday I finished to install my 4 servers cluster with old
>>>>>> machine.
>>>>>> It's slow, but it's working. So I will do some testing.
>>>>>> You are recommending to modify the timestamp to be to the second or
>>>>>> minute and have more entries per row. Is that because it's better to
>>>>>> have more columns than rows? Or it's more because that will allow to
>>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which if