Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Timestamp as a key good practice?


+
Jean-Marc Spaggiari 2012-06-13, 16:16
+
Otis Gospodnetic 2012-06-14, 06:06
+
Jean-Marc Spaggiari 2012-06-14, 10:39
+
Michael Segel 2012-06-14, 11:55
+
Jean-Marc Spaggiari 2012-06-14, 12:22
+
Michael Segel 2012-06-14, 18:14
+
Jean-Marc Spaggiari 2012-06-14, 18:47
+
Michael Segel 2012-06-14, 19:46
+
Michael Segel 2012-06-15, 14:21
+
Jean-Marc Spaggiari 2012-06-16, 11:22
+
Michel Segel 2012-06-16, 14:35
+
Jean-Marc Spaggiari 2012-06-16, 14:42
+
Michael Segel 2012-06-16, 16:33
+
Rob Verkuylen 2012-06-16, 19:10
+
Jean-Marc Spaggiari 2012-06-21, 11:43
Copy link to this message
-
Re: Timestamp as a key good practice?
If you have a really small cluster...
You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a single node. (Secondary too)
Then you have Data Nodes that run DN, TT, and RS.

That would solve any ZK RS problems.

On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote:

> Hi Mike, Hi Rob,
>
> Thanks for your replies and advices. Seems that now I'm due for some
> implementation. I'm readgin Lars' book first and when I will be done I
> will start with the coding.
>
> I already have my Zookeeper/Hadoop/HBase running and based on the
> first pages I read, I already know it's not well done since I have put
> a DataNode and a Zookeeper server on ALL the servers ;) So. More
> reading for me for the next few days, and then I will start.
>
> Thanks again!
>
> JM
>
> 2012/6/16, Rob Verkuylen <[EMAIL PROTECTED]>:
>> Just to add from my experiences:
>>
>> Yes hotspotting is bad, but so are devops headaches. A reasonable machine
>> can handle 3-4000 puts a second with ease, and a simple timerange scan can
>> give you the records you need. I have my doubts you will be hitting these
>> amounts anytime soon. A simple setup will get your PoC and then scale when
>> you need to scale.
>>
>> Rob
>>
>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel
>> <[EMAIL PROTECTED]>wrote:
>>
>>> Jean-Marc,
>>>
>>> You indicated that you didn't want to do full table scans when you want
>>> to
>>> find out which files hadn't been touched since X time has past.
>>> (X could be months, weeks, days, hours, etc ...)
>>>
>>> So here's the thing.
>>> First,  I am not convinced that you will have hot spotting.
>>> Second, you end up having to now do 26 scans instead of one. Then you
>>> need
>>> to join the result set.
>>>
>>> Not really a good solution if you think about it.
>>>
>>> Oh and I don't believe that you will be hitting a single region, although
>>> you may hit  a region hard.
>>> (Your second table's key is on the timestamp of the last update to the
>>> file.  If the file hadn't been touched in a week, there's the probability
>>> that at scale, it won't be in the same region as a file that had recently
>>> been touched. )
>>>
>>> I wouldn't recommend HBaseWD. Its cute, its not novel,  and can only be
>>> applied on a subset of problems.
>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.)
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>>
>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote:
>>>
>>>> Let's imagine the timestamp is "123456789".
>>>>
>>>> If I salt it with later from 'a' to 'z' them it will always be split
>>>> between few RegionServers. I will have like "t123456789". The issue is
>>>> that I will have to do 26 queries to be able to find all the entries.
>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B,
>>>> and so on.
>>>>
>>>> So what's worst? Am I better to deal with the hotspotting? Salt the
>>>> key myself? Or what if I use something like HBaseWD?
>>>>
>>>> JM
>>>>
>>>> 2012/6/16, Michel Segel <[EMAIL PROTECTED]>:
>>>>> You can't salt the key in the second table.
>>>>> By salting the key, you lose the ability to do range scans, which is
>>> what
>>>>> you want to do.
>>>>>
>>>>>
>>>>>
>>>>> Sent from a remote device. Please excuse any typos...
>>>>>
>>>>> Mike Segel
>>>>>
>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <
>>> [EMAIL PROTECTED]>
>>>>> wrote:
>>>>>
>>>>>> Thanks all for your comments and suggestions. Regarding the
>>>>>> hotspotting I will try to salt the key in the 2nd table and see the
>>>>>> results.
>>>>>>
>>>>>> Yesterday I finished to install my 4 servers cluster with old
>>>>>> machine.
>>>>>> It's slow, but it's working. So I will do some testing.
>>>>>>
>>>>>> You are recommending to modify the timestamp to be to the second or
>>>>>> minute and have more entries per row. Is that because it's better to
>>>>>> have more columns than rows? Or it's more because that will allow to
>>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which if
+
Jean-Marc Spaggiari 2012-06-22, 19:43
+
Jean-Marc Spaggiari 2012-06-23, 02:20
+
Jean-Daniel Cryans 2012-06-26, 17:50
+
Jean-Marc Spaggiari 2012-06-26, 17:56
+
Jean-Daniel Cryans 2012-06-26, 18:12
+
Michael Segel 2012-06-26, 19:01
+
Jean-Marc Spaggiari 2012-06-26, 19:04
+
Doug Meil 2012-06-14, 21:18