Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Insert into tall table 50% faster than wide table


Copy link to this message
-
Re: Insert into tall table 50% faster than wide table
Bryan Keller 2010-12-23, 22:28
I revised the test so that it creates a single Put for each customer. Previously I was creating a separate Put for each order, even if the order was for the same customer. I submit batches of Puts using HTable.put(List<Put>).

Performance with both approaches was about the same. It doesn't appear as if row locks are an issue in my case, perhaps because the Puts for a customer's orders are mostly in the same List<Put>?

As to cluster setup, I am testing tall vs wide on the exact same cluster. Keys are all random UUIDs so I'm assuming I should get a good spread. Are there configuration options I should be looking at that could help wide table performance for inserts?

I was thinking about serializing the order data, but then I will run into issues of versioning and such, and then I am back to a tightly structured schema. Thus I did like storing the order fields in separate columns. Read performance seems to be very good, it is the writes that are slower.
On Dec 23, 2010, at 11:54 AM, Ryan Rawson wrote:

> Hi all,
>
> What does the region count look like between your tall and wide
> tables?  If you dont get a good spread of regions across your cluster
> you don't get full parallelism on all your hardware.
>
> The row lock thing is another thing to watch out for, concurrent puts
> will serialize along the row lock.
>
> -ryan
>
> On Thu, Dec 23, 2010 at 5:20 AM, Michael Segel
> <[EMAIL PROTECTED]> wrote:
>>
>> Uhm... just a couple of thoughts...
>>
>> For clarification ... lets call Bryan's "order's columns" the detail of the order. Columns of columns is a bit confusing...
>>
>> Its becoming more apparent that schema design will play a large consideration in terms of performance, and because its going to be dependent on HBase's internals, its very possible that it can be tied to versions.
>> This means that as HBase evolves, those seeking optimum performance may have to periodically review their schema decisions.
>>
>> The first thing I'd recommend on the 'wide table' schema is to not store the individual order's columns as separate columns, but as part of the order itself. The main reason for this is that you will never fetch an order's detail by itself. A quick and cheap way of serializing the order detail is to use something Dick Pick did around 40 years ago. In the Pick databases (ie Revelation), a non-printable ASCII character was used as a column delimiter. You could use the '|' (pipe) character, but someone could point out that its possible that it could occur in the data. A non-printable ascii character (char 254??) would less likely be part of the data. This works well because when you want to get the order, you can fetch it from HBase, then parse the order based on a string token. (Very fast and efficient)
>>
>> This will make life easier in the long run...
>>
>> It will also have a positive impact on your code.
>> On each Mapper.map() iteration, or rather code iteration [see assumption below], you have your row_id, and then one put for the column write (that contains the 10 detail items.) Note: What has a higher cost? Using a string buffer and concatenation of your 10 detail items then taking its bytecode, or doing 10 put()s?
>>
>> Note the following: The discussion above is for uber performance gains. There will be code improvements, however they will be relatively modest when compared to other potential gains.
>>
>> Assumption(s):
>> Bryan is attempting to create a simulation with 10K customers, 600 orders each. (10 items per order). This is a performance test.
>> This probably isn't a m/r program but a single client doing an insert. Note that its a relative performance issue and it would be easier to do as a single program and not a distributed one. This could be a m/r if Bryan pre-builds the list of customer orders before starting the job... Or it could be a multi-threaded client where each thread reads from the pre-built list and performs an insert.
>>
>> If the assumption is true, then Bryan is going to randomly pick a customer id, create an order and insert the order in to HBase. (randomly pick a number between 1 and N where N represents the number of customers who haven't placed 600 orders, and then count the number of orders and remove each customer with 600 orders from the list)