The key question to ask about this use case is the access pattern. Do you need real-time access to new information as it is created? (I.e. if someone reads an article, do your queries need to immediately reflect that?) If not, and a batch approach is fine (say, nightly processing) then Hadoop is a good first step; it can map/reduce over the logs and aggregate / index the data, either creating static reports like you say below for all the users & pages, or inserting the data into a database.
If, on the other hand, you need real-time ingest & reporting, that's where HBase would be a better fit. You could write the code so that every "log" you write is actually passed to an API that inserts the data into one or more HBase tables in real-time.
The trick is that since HBase doesn't have built-in secondary indexing, you'd have to write the data in ways that could provide low latency responses to the queries below. That likely means denormalizing the data; in your case, probably one table that's keyed by page & time ("all users that have been to the webpage X in the last N days"), another that's keyed by user & time ("all the pages seen by a given user"), etc. So in other words, every action would result in multiple writes into HBase. (This is the same thing that happens in a relational database--an "index" means you're writing the data in two places--but there, it's hidden from you and maintained transparently). But, once you get that working, you can have very fast access to real time data for any of those queries.
Note: that's a lot of work. If you don't really need real-time ingest and query, Hadoop queries over the logs are much simpler (especially with tools like Hive, where you can write real SQL statements and it automatically translates them into Hadoop map/reduce jobs).
Also, 10TB isn't outside of the range that a traditional database could handle (given the right HW, schema design & indexing). You may find it is simpler to model your problem that way, either using Hadoop as a bridge between the raw log data and the database (if offline is OK) or inserting directly. The key benefit of going the Hadoop/HBase route is horizontal scalability, meaning that even if you don't know your eventual size target, you can be confident that you can scale linearly by adding hardware. That's critical if you're Google or Facebook, but not as frequently required for smaller businesses. Don't over-engineer ... :)
On Nov 8, 2012, at 3:00 AM, Nick maillard wrote:
I'm currently testing Hbase/Hadoop in terms of performance but also in terms off
applicability. After some tries, and reads I'm wondering If Hbase is well fitted
for the current need I'm testing.
Say I had logs on websites listing users going to webpage, reading an article,
liking a piece of data, commenting or even bookmarking.
I would store these logs on a long period and for a lot of different websites
and I would like to use the data with these questions:
- All users that have been to the webpage X in the last Ndays
- All users that have liked and then bookmarked a page in a range of Y days.
- All the pages that are commented X times in the last N days.
- All users that have commented a page W and liked a page P.
- All pages seen,liked or commented by a given user.
As you see this might a very SQL way of thinking. The way I understand the
questions being different in nature I would have different tables to answer them.
Am I correct? How could this be represented and would sql be a better fit?
The data would be large around a 10 Tbytes.