Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Schema Design Question


Copy link to this message
-
Re: Schema Design Question
Ted Yu 2013-04-26, 21:34
My understanding of your use case is that data for different jobIds would
be continuously loaded into the underlying table(s).

Looks like you can have one table per job. This way you drop the table
after map reduce is complete. In the single table approach, you would
delete many rows in the table which is not as fast as dropping the separate
table.

Cheers

On Sat, Apr 27, 2013 at 3:49 AM, Cameron Gandevia <[EMAIL PROTECTED]>wrote:

> Hi
>
> I am new to HBase, I have been trying to POC an application and have a
> design questions.
>
> Currently we have a single table with the following key design
>
> jobId_batchId_bundleId_uniquefileId
>
> This is an offline processing system so data would be bulk loaded into
> HBase via map/reduce jobs. We only need to support report generation
> queries using map/reduce over a batch (And possibly a single column filter)
> with the batchId as the start/end scan key. Once we have finished
> processing a job we are free to remove the data from HBase.
>
> We have varied workloads so a job could be made up of 10 rows, 100,000 rows
> or 1 billion rows with the average falling somewhere around 10 million
> rows.
>
> My question is related to pre-splitting. If we have a billion rows all with
> the same batchId (Our map/reduce scan key) my understanding is we should
> perform pre-splitting to create buckets hosted by different regions. If a
> jobs workload can be so varied would it make sense to have a single table
> containing all jobs? Or should we create 1 table per job and pre-split the
> table for the given workload? If we had separate table we could drop them
> when no longer needed.
>
> If we didn't have a separate table per job how should we perform splitting?
> Should we choose our largest possible workload and split for that? even
> though 90% of our jobs would fall in the lower bound in terms of row count.
> Would we experience any issue purging jobs of varying sizes if everything
> was in a single table?
>
> any advice would be greatly appreciated.
>
> Thanks
>