-Re: Benchmarking and improvement of HBase's performance for a common bulk data workload
Thanks for thinking about ways to optimize such workload.
You can start with the following when setting up your cluster:
For transactions, HBase is unique compared with PostgreSQL. See:
On Sat, Apr 27, 2013 at 1:20 PM, Atri Sharma <[EMAIL PROTECTED]> wrote:
> Hi all,
> I have been discussing with Priyank sir on the following style of
> workload and whether we can improve HBase's performance in this area.
> The usecase is as follows:
> 1) Bulk load data.
> 2) Query the data multiple times(read access mostly, and no real time
> This is a common workload, and I am pretty interested in benchmarking
> HBase's performance in this area, as well as improve this further.
> Please advice me on how I can proceed in benchmarking. Specifically,
> how will I need to set up a HBase cluster, will there be any specific
> requirements of the cluster for this type of testing?
> I worked on a patch to improve performance for a similar usecase in
> PostgreSQL. The case is pretty similar, bulk load of data, large
> number of mostly read only queries, and then deletion of the data.
> The optimization I targeted was the cost of writes to disk.
> Specifically, setting of flags(hint bits) for tracking the commt
> status of inserting/deleting transaction was causing a write overhead.
> I tried to mitigate this by making a cache which holds the transaction
> id in case of the above mentioned workload, hence mitigating the cost
> of writes.
> I will start benchmarking once I have the system set up and then start
> thinking of tests. Once I have an outline in my mind, I shall post it
> on the list.
> i will require the community's guidance in this a lot.
> Thoughts/Comments/Advice please?