Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Schema Design Question

Copy link to this message
Re: Schema Design Question
I would have to agree.
The use case doesn't make much sense for HBase and sounds a bit more like a problem for Hive.

The OP indicated that the data was disposable after a round of processing.
IMHO Hive is a better fit.
Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 29, 2013, at 12:46 AM, Asaf Mesika <[EMAIL PROTECTED]> wrote:

> I actually don't see the benefit of saving the data into HBase if all you
> do is read per job id and purges it. Why not accumulate into HDFS per job
> id and then dump the file? The way I see it, HBase is good for querying
> parts of your data, even if it is only 10 rows. In your case your average
> is 1 billion, so streaming it from hdfs seems faster .
> On Saturday, April 27, 2013, Enis Söztutar wrote:
>> Hi,
>> Interesting use case. I think it depends on job many jobId's you expect to
>> have. If it is on the order of thousands, I would caution against going the
>> one table per jobid approach, since for every table, there is some master
>> overhead, as well as file structures in hdfs. If jobId's are managable,
>> going with separate tables makes sense if you want to efficiently delete
>> all the data related to a job.
>> Also pre-splitting will depend on expected number of jobIds / batchIds and
>> their ranges vs desired number of regions. You would want to keep number of
>> regions hosted by a single region server in the low tens, thus, your splits
>> can be across jobs or within jobs depending on cardinality. Can you share
>> some more?
>> Enis
>> On Fri, Apr 26, 2013 at 2:34 PM, Ted Yu <[EMAIL PROTECTED]<javascript:;>>
>> wrote:
>>> My understanding of your use case is that data for different jobIds would
>>> be continuously loaded into the underlying table(s).
>>> Looks like you can have one table per job. This way you drop the table
>>> after map reduce is complete. In the single table approach, you would
>>> delete many rows in the table which is not as fast as dropping the
>> separate
>>> table.
>>> Cheers
>>> On Sat, Apr 27, 2013 at 3:49 AM, Cameron Gandevia <[EMAIL PROTECTED]<javascript:;>
>>>> wrote:
>>>> Hi
>>>> I am new to HBase, I have been trying to POC an application and have a
>>>> design questions.
>>>> Currently we have a single table with the following key design
>>>> jobId_batchId_bundleId_uniquefileId
>>>> This is an offline processing system so data would be bulk loaded into
>>>> HBase via map/reduce jobs. We only need to support report generation
>>>> queries using map/reduce over a batch (And possibly a single column
>>> filter)
>>>> with the batchId as the start/end scan key. Once we have finished
>>>> processing a job we are free to remove the data from HBase.
>>>> We have varied workloads so a job could be made up of 10 rows, 100,000
>>> rows
>>>> or 1 billion rows with the average falling somewhere around 10 million
>>>> rows.
>>>> My question is related to pre-splitting. If we have a billion rows all
>>> with
>>>> the same batchId (Our map/reduce scan key) my understanding is we
>> should
>>>> perform pre-splitting to create buckets hosted by different regions.
>> If a
>>>> jobs workload can be so varied would it make sense to have a single
>> table
>>>> containing all jobs? Or should we create 1 table per job and pre-split
>>> the
>>>> table for the given workload? If we had separate table we could drop
>> them
>>>> when no longer needed.
>>>> If we didn't have a separate table per job how should we perform
>>> splitting?
>>>> Should we choose our largest possible workload and split for that? even
>>>> though 90% of our jobs would fall in the lower bound in terms of row
>>> count.
>>>> Would we experience any issue purging jobs of varying sizes if
>> everything
>>>> was in a single table?
>>>> any advice would be greatly appreciated.
>>>> Thanks