Mark 2011-11-18, 21:29
Michel Segel 2011-11-20, 08:57
Mark 2011-11-20, 17:54
lars hofhansl 2011-11-20, 19:30
Mark 2011-11-21, 01:18
Amandeep Khurana 2011-11-21, 01:36
I think you've gotten a bit more of an explanation...
The reason I say 'It depends...' is that there are arguments for either design.
If your log events are going to be accessed independently by type... Meaning that you're going to process only a single type of an event at a time, then it makes sense to separate the data. Note I'm talking about your primary access path.
At the same time, it was pointed out that if you're not going to be accessing the log events one at a time, you may actually want a hybrid approach where you keep your index in HBase but store your event logs in a sequence file.
And again, it all depends on what you want to do with the data. That's why you can't always say ... 'if y then do x...'
There are other issues too. How will the data end up sitting in the table? Sure his is more of an issue of schema/key design, but it will also have an impact on your systems performance.
In terms of get() performance HBase scales linearly. In terms of scans, it doesn't.
So there's a lot to think about...
Sent from a remote device. Please excuse any typos...
On Nov 20, 2011, at 7:36 PM, Amandeep Khurana <[EMAIL PROTECTED]> wrote:
> This is an interesting discussion and like Michel said - the answer to your
> question depends on what you are trying to achieve. However, here are the
> points that I would think about:
> What are the access patters of the various buckets of data that you want to
> put in HBase? For instance, would the SearchLog and PageViewLog tables be
> access together all the time? Would they be primarily scanned or just
> random look ups. What are the cache requirements? Are both going to be
> equally read and written? Ideally, you want to store data with separate
> access patterns in separate tables.
> Then, what kind of schema are you looking at. When I say schema, I mean
> keys and column families. Now, if you concatenate the three tables you
> mentioned and let's say your keys are prefixed with the type of data:
> you will be using some servers more than others for different parts of the
> data. In theory, that should not happen but in most practical scenarios
> when splitting happens, regions tend to stick together. There are ways to
> work around that as well.
> Like Lars said, it's okay to have multiple tables. But you don't want to
> end up 100s of tables. You ideally want to optimize for the number of
> tables depending on the access patterns.
> Again, this discussion will be kind of abstract without a specific example.
> On Fri, Nov 18, 2011 at 1:29 PM, Mark <[EMAIL PROTECTED]> wrote:
>> Is it better to have many smaller tables are one larger table? For example
>> if we wanted to store user action logs we could do either of the following:
>> Multiple tables:
>> - SearchLog
>> - PageViewLog
>> - LoginLog
>> One table:
>> - ActionLog where the key could be a concatenation of the action type ie
>> (search, pageview, login)
>> Any ideas? Are there any performance considerations on having multiple
>> smaller tables?
Mark 2011-11-21, 15:43
Michael Segel 2011-11-21, 16:13
Ian Varley 2011-11-21, 16:21
Michael Segel 2011-11-21, 17:04
Ian Varley 2011-11-21, 17:11