-Re: Schema Design Question
lars hofhansl 2013-05-01, 05:35
HBase is generally good at honing in to a small (maybe 10-100m rows) continuous subset of an essentially unlimited dataset.
If all you ever do is scanning _everything_ and then throwing it away, a straight scan (using Impala for example) or direct M/R on file(s) in HDFS is far better.
From: Michel Segel <[EMAIL PROTECTED]>
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Sent: Monday, April 29, 2013 6:52 AM
Subject: Re: Schema Design Question
I would have to agree.
The use case doesn't make much sense for HBase and sounds a bit more like a problem for Hive.
The OP indicated that the data was disposable after a round of processing.
IMHO Hive is a better fit.
Sent from a remote device. Please excuse any typos...
On Apr 29, 2013, at 12:46 AM, Asaf Mesika <[EMAIL PROTECTED]> wrote:
> I actually don't see the benefit of saving the data into HBase if all you
> do is read per job id and purges it. Why not accumulate into HDFS per job
> id and then dump the file? The way I see it, HBase is good for querying
> parts of your data, even if it is only 10 rows. In your case your average
> is 1 billion, so streaming it from hdfs seems faster .
> On Saturday, April 27, 2013, Enis Söztutar wrote:
>> Interesting use case. I think it depends on job many jobId's you expect to
>> have. If it is on the order of thousands, I would caution against going the
>> one table per jobid approach, since for every table, there is some master
>> overhead, as well as file structures in hdfs. If jobId's are managable,
>> going with separate tables makes sense if you want to efficiently delete
>> all the data related to a job.
>> Also pre-splitting will depend on expected number of jobIds / batchIds and
>> their ranges vs desired number of regions. You would want to keep number of
>> regions hosted by a single region server in the low tens, thus, your splits
>> can be across jobs or within jobs depending on cardinality. Can you share
>> some more?
>>> My understanding of your use case is that data for different jobIds would
>>> be continuously loaded into the underlying table(s).
>>> Looks like you can have one table per job. This way you drop the table
>>> after map reduce is complete. In the single table approach, you would
>>> delete many rows in the table which is not as fast as dropping the
>>>> I am new to HBase, I have been trying to POC an application and have a
>>>> design questions.
>>>> Currently we have a single table with the following key design
>>>> This is an offline processing system so data would be bulk loaded into
>>>> HBase via map/reduce jobs. We only need to support report generation
>>>> queries using map/reduce over a batch (And possibly a single column
>>>> with the batchId as the start/end scan key. Once we have finished
>>>> processing a job we are free to remove the data from HBase.
>>>> We have varied workloads so a job could be made up of 10 rows, 100,000
>>>> or 1 billion rows with the average falling somewhere around 10 million
>>>> My question is related to pre-splitting. If we have a billion rows all
>>>> the same batchId (Our map/reduce scan key) my understanding is we
>>>> perform pre-splitting to create buckets hosted by different regions.
>> If a
>>>> jobs workload can be so varied would it make sense to have a single
>>>> containing all jobs? Or should we create 1 table per job and pre-split
>>>> table for the given workload? If we had separate table we could drop