Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Efficient Tablet Merging [SEC=UNOFFICIAL]

Dickson, Matt MR 2013-10-02, 03:58
Eric Newton 2013-10-02, 13:05
Dickson, Matt MR 2013-10-02, 22:35
Eric Newton 2013-10-03, 03:28
Adam Fuchs 2013-10-03, 12:07
Copy link to this message
RE: Efficient Tablet Merging [SEC=UNOFFICIAL]

Hi Eric,

We have gone with the second more conservative option. We changed our split threshold to 10GB and then we ran a merge over a week worth of tablets which has resulted in one tablet with a massive number of files. We then ran a query over that range and it is returning an message saying:

Tablet has too many files (3n;20130914;20130907...) retrying...

We assumed that when the merge was done that a major compaction would be started, which would notice that the tablet is too large, split it into 10GB tablets. We assumed that we would not have to manually start any compaction but instead it would be scheduled at some point after the merge finished.

We have completed three separate merges of week long ranges and now have identified 3 tablet extents with too many files.

Can you please explain what is supposed to happen? And whether after the merge, compact command for those ranges needs to be run (or will it do it automatically, as we have not seen any started)?


From: Eric Newton [mailto:[EMAIL PROTECTED]]
Sent: Thursday, 3 October 2013 13:28
Subject: Re: Efficient Tablet Merging [SEC=UNOFFICIAL]

I'll use ASCII graphics to demonstrate the size of a tablet.

Small: []
Medium: [ ]
Large: [  ]

Think of it like this... if you are running age-off... you probably have lots of little buckets of rows at the beginning and larger buckets at the end:

[][][][][][][][][]...[ ][ ][ ][ ][ ][  ][  ][    ][    ][    ][    ][    ][    ]

What you probably want is something like this:

[               ][       ][       ][       ][       ][       ][       ][       ]

Some big bucket at the start, with old data, and some larger buckets for everything afterwards.  But... this would probably work:

[       ][       ][       ][       ][       ][       ][       ][       ][       ]

Just a bunch of larger tablets throughout.

So you need to set your merge size to "[      ]" (4G), and you can always keep creating smaller tablets for future rows with manual splits:

[       ][       ][       ][       ][       ][       ][       ][       ][       ][  ][  ][  ][  ][  ]
So increase the split threshold to 4G, and merge on 4G, but continue to make manual splits for your current days, as necessary.  Merge them away later.
On Wed, Oct 2, 2013 at 6:35 PM, Dickson, Matt MR <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:


Thanks Eric,

If I do the merge with size of 4G does the split threshold need to be increased to the 4G also?

From: Eric Newton [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
Sent: Wednesday, 2 October 2013 23:05
Subject: Re: Efficient Tablet Merging [SEC=UNOFFICIAL]

The most efficient way is kind of scary.  If this is a production system, I would not recommend it.

First, find out the size of your 10x tablets.  Let's say it's 10G.  Set your split threshold to 10G.  Then merge all old tablets.... all of them into one tablet.  This will dump thousands of files into a single tablet, but it will soon split out again into the nice 10G tablets you are looking for.  The system will probably be unusable during this operation.

The more conservative way is to specify the merge in single steps (the master will only coordinate a single merge on a table at a time anyhow).  You can do it by range or by size... I would do it by size, especially if you are aging off your old data.

Compacting the data won't have any effect on the speed of the merge.


On Tue, Oct 1, 2013 at 11:58 PM, Dickson, Matt MR <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:


I have a table that we create splits of the form yyyymmdd-nnnn where nnnn ranges from 0000 to 0840.  The bulk of our data is loaded for the current date with no data loaded for days older than 3 days so from my understanding it would be wise to merge splits older than 3 days in order to reduce the overall tablet count.  It would still be optimal to maintain some distribution of tablets for a day across the cluster so I'm looking at merging splits in 10 increments eg, merge -b 20130901-0000 -e 20130901-0009, therefore reducing 840 splits per day to 84.

Currently we have 120K tablets (size 1G) on a cluster of 56 nodes and our ingest has slowed as the data quantity and tablet count has grown.  Initialy we were achieving 200-300K, now 50-100K.

My question is, what is the best way to do this merge?  Should we use the merge command with the size option set at something like 5G, or maybe use the compaction command?
Thanks in advance,
Matt Dickson
Eric Newton 2013-10-03, 13:51
Dickson, Matt MR 2013-10-04, 00:43
Eric Newton 2013-10-04, 01:20
Dickson, Matt MR 2013-10-04, 03:20
Eric Newton 2013-10-04, 03:27
Kristopher Kane 2013-10-04, 02:02
Dickson, Matt MR 2013-10-03, 03:45