Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Timestamp as a key good practice?


Copy link to this message
-
Re: Timestamp as a key good practice?
If you have a really small cluster...
You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a single node. (Secondary too)
Then you have Data Nodes that run DN, TT, and RS.

That would solve any ZK RS problems.

On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote:

> Hi Mike, Hi Rob,
>
> Thanks for your replies and advices. Seems that now I'm due for some
> implementation. I'm readgin Lars' book first and when I will be done I
> will start with the coding.
>
> I already have my Zookeeper/Hadoop/HBase running and based on the
> first pages I read, I already know it's not well done since I have put
> a DataNode and a Zookeeper server on ALL the servers ;) So. More
> reading for me for the next few days, and then I will start.
>
> Thanks again!
>
> JM
>
> 2012/6/16, Rob Verkuylen <[EMAIL PROTECTED]>:
>> Just to add from my experiences:
>>
>> Yes hotspotting is bad, but so are devops headaches. A reasonable machine
>> can handle 3-4000 puts a second with ease, and a simple timerange scan can
>> give you the records you need. I have my doubts you will be hitting these
>> amounts anytime soon. A simple setup will get your PoC and then scale when
>> you need to scale.
>>
>> Rob
>>
>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel
>> <[EMAIL PROTECTED]>wrote:
>>
>>> Jean-Marc,
>>>
>>> You indicated that you didn't want to do full table scans when you want
>>> to
>>> find out which files hadn't been touched since X time has past.
>>> (X could be months, weeks, days, hours, etc ...)
>>>
>>> So here's the thing.
>>> First,  I am not convinced that you will have hot spotting.
>>> Second, you end up having to now do 26 scans instead of one. Then you
>>> need
>>> to join the result set.
>>>
>>> Not really a good solution if you think about it.
>>>
>>> Oh and I don't believe that you will be hitting a single region, although
>>> you may hit  a region hard.
>>> (Your second table's key is on the timestamp of the last update to the
>>> file.  If the file hadn't been touched in a week, there's the probability
>>> that at scale, it won't be in the same region as a file that had recently
>>> been touched. )
>>>
>>> I wouldn't recommend HBaseWD. Its cute, its not novel,  and can only be
>>> applied on a subset of problems.
>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.)
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>>
>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote:
>>>
>>>> Let's imagine the timestamp is "123456789".
>>>>
>>>> If I salt it with later from 'a' to 'z' them it will always be split
>>>> between few RegionServers. I will have like "t123456789". The issue is
>>>> that I will have to do 26 queries to be able to find all the entries.
>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B,
>>>> and so on.
>>>>
>>>> So what's worst? Am I better to deal with the hotspotting? Salt the
>>>> key myself? Or what if I use something like HBaseWD?
>>>>
>>>> JM
>>>>
>>>> 2012/6/16, Michel Segel <[EMAIL PROTECTED]>:
>>>>> You can't salt the key in the second table.
>>>>> By salting the key, you lose the ability to do range scans, which is
>>> what
>>>>> you want to do.
>>>>>
>>>>>
>>>>>
>>>>> Sent from a remote device. Please excuse any typos...
>>>>>
>>>>> Mike Segel
>>>>>
>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <
>>> [EMAIL PROTECTED]>
>>>>> wrote:
>>>>>
>>>>>> Thanks all for your comments and suggestions. Regarding the
>>>>>> hotspotting I will try to salt the key in the 2nd table and see the
>>>>>> results.
>>>>>>
>>>>>> Yesterday I finished to install my 4 servers cluster with old
>>>>>> machine.
>>>>>> It's slow, but it's working. So I will do some testing.
>>>>>>
>>>>>> You are recommending to modify the timestamp to be to the second or
>>>>>> minute and have more entries per row. Is that because it's better to
>>>>>> have more columns than rows? Or it's more because that will allow to
>>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which if
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB