Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Efficient Tablet Merging [SEC=UNOFFICIAL]


Copy link to this message
-
Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
I'll use ASCII graphics to demonstrate the size of a tablet.

Small: []
Medium: [ ]
Large: [  ]

Think of it like this... if you are running age-off... you probably have
lots of little buckets of rows at the beginning and larger buckets at the
end:

[][][][][][][][][]...[ ][ ][ ][ ][ ][  ][  ][    ][    ][    ][    ][    ][
   ]

What you probably want is something like this:

[               ][       ][       ][       ][       ][       ][       ][
    ]

Some big bucket at the start, with old data, and some larger buckets for
everything afterwards.  But... this would probably work:

[       ][       ][       ][       ][       ][       ][       ][       ][
    ]

Just a bunch of larger tablets throughout.

So you need to set your merge size to "[      ]" (4G), and you can always
keep creating smaller tablets for future rows with manual splits:

[       ][       ][       ][       ][       ][       ][       ][       ][
    ][  ][  ][  ][  ][  ]
So increase the split threshold to 4G, and merge on 4G, but continue to
make manual splits for your current days, as necessary.  Merge them away
later.
-Eric
On Wed, Oct 2, 2013 at 6:35 PM, Dickson, Matt MR <
[EMAIL PROTECTED]> wrote:

> **
>
> *UNOFFICIAL*
> Thanks Eric,
>
> If I do the merge with size of 4G does the split threshold need to be
> increased to the 4G also?
>
>  ------------------------------
> *From:* Eric Newton [mailto:[EMAIL PROTECTED]]
> *Sent:* Wednesday, 2 October 2013 23:05
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
>
>  The most efficient way is kind of scary.  If this is a production
> system, I would not recommend it.
>
> First, find out the size of your 10x tablets.  Let's say it's 10G.  Set
> your split threshold to 10G.  Then merge all old tablets.... all of them
> into one tablet.  This will dump thousands of files into a single tablet,
> but it will soon split out again into the nice 10G tablets you are looking
> for.  The system will probably be unusable during this operation.
>
> The more conservative way is to specify the merge in single steps (the
> master will only coordinate a single merge on a table at a time anyhow).
>  You can do it by range or by size... I would do it by size, especially if
> you are aging off your old data.
>
> Compacting the data won't have any effect on the speed of the merge.
>
> -Eric
>
>
>
> On Tue, Oct 1, 2013 at 11:58 PM, Dickson, Matt MR <
> [EMAIL PROTECTED]> wrote:
>
>> **
>>
>> *UNOFFICIAL*
>> I have a table that we create splits of the form yyyymmdd-*nnnn *where
>> nnnn ranges from 0000 to 0840.  The bulk of our data is loaded for the
>> current date with no data loaded for days older than 3 days so from my
>> understanding it would be wise to merge splits older than 3 days in order
>> to reduce the overall tablet count.  It would still be optimal to
>> maintain some distribution of tablets for a day across the cluster so I'm
>> looking at merging splits in 10 increments eg, merge -b 20130901-0000 -e
>> 20130901-0009, therefore reducing 840 splits per day to 84.
>>
>> Currently we have 120K tablets (size 1G) on a cluster of 56 nodes and our
>> ingest has slowed as the data quantity and tablet count has grown.
>> Initialy we were achieving 200-300K, now 50-100K.
>>
>> My question is, what is the best way to do this merge?  Should we use the
>> merge command with the size option set at something like 5G, or maybe use
>> the compaction command?
>>
>> From my tests this process could take some time so I'm keen to understand
>> the most efficient approach.
>>
>> Thanks in advance,
>> Matt Dickson
>>
>
>