Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Efficient Tablet Merging [SEC=UNOFFICIAL]


Copy link to this message
-
Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
Adam Fuchs 2013-10-03, 12:07
Never underestimate the power of ascii art!

Adam
On Oct 2, 2013 11:28 PM, "Eric Newton" <[EMAIL PROTECTED]> wrote:

> I'll use ASCII graphics to demonstrate the size of a tablet.
>
> Small: []
> Medium: [ ]
> Large: [  ]
>
> Think of it like this... if you are running age-off... you probably have
> lots of little buckets of rows at the beginning and larger buckets at the
> end:
>
> [][][][][][][][][]...[ ][ ][ ][ ][ ][  ][  ][    ][    ][    ][    ][
>  ][    ]
>
> What you probably want is something like this:
>
> [               ][       ][       ][       ][       ][       ][       ][
>     ]
>
> Some big bucket at the start, with old data, and some larger buckets for
> everything afterwards.  But... this would probably work:
>
> [       ][       ][       ][       ][       ][       ][       ][       ][
>       ]
>
> Just a bunch of larger tablets throughout.
>
> So you need to set your merge size to "[      ]" (4G), and you can always
> keep creating smaller tablets for future rows with manual splits:
>
> [       ][       ][       ][       ][       ][       ][       ][       ][
>       ][  ][  ][  ][  ][  ]
>
>
> So increase the split threshold to 4G, and merge on 4G, but continue to
> make manual splits for your current days, as necessary.  Merge them away
> later.
>
>
> -Eric
>
>
>
>
> On Wed, Oct 2, 2013 at 6:35 PM, Dickson, Matt MR <
> [EMAIL PROTECTED]> wrote:
>
>> **
>>
>> *UNOFFICIAL*
>> Thanks Eric,
>>
>> If I do the merge with size of 4G does the split threshold need to be
>> increased to the 4G also?
>>
>>  ------------------------------
>> *From:* Eric Newton [mailto:[EMAIL PROTECTED]]
>> *Sent:* Wednesday, 2 October 2013 23:05
>> *To:* [EMAIL PROTECTED]
>> *Subject:* Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
>>
>>  The most efficient way is kind of scary.  If this is a production
>> system, I would not recommend it.
>>
>> First, find out the size of your 10x tablets.  Let's say it's 10G.  Set
>> your split threshold to 10G.  Then merge all old tablets.... all of them
>> into one tablet.  This will dump thousands of files into a single tablet,
>> but it will soon split out again into the nice 10G tablets you are looking
>> for.  The system will probably be unusable during this operation.
>>
>> The more conservative way is to specify the merge in single steps (the
>> master will only coordinate a single merge on a table at a time anyhow).
>>  You can do it by range or by size... I would do it by size, especially if
>> you are aging off your old data.
>>
>> Compacting the data won't have any effect on the speed of the merge.
>>
>> -Eric
>>
>>
>>
>> On Tue, Oct 1, 2013 at 11:58 PM, Dickson, Matt MR <
>> [EMAIL PROTECTED]> wrote:
>>
>>> **
>>>
>>> *UNOFFICIAL*
>>> I have a table that we create splits of the form yyyymmdd-*nnnn *where
>>> nnnn ranges from 0000 to 0840.  The bulk of our data is loaded for the
>>> current date with no data loaded for days older than 3 days so from my
>>> understanding it would be wise to merge splits older than 3 days in order
>>> to reduce the overall tablet count.  It would still be optimal to
>>> maintain some distribution of tablets for a day across the cluster so I'm
>>> looking at merging splits in 10 increments eg, merge -b 20130901-0000 -e
>>> 20130901-0009, therefore reducing 840 splits per day to 84.
>>>
>>> Currently we have 120K tablets (size 1G) on a cluster of 56 nodes and
>>> our ingest has slowed as the data quantity and tablet count has grown.
>>> Initialy we were achieving 200-300K, now 50-100K.
>>>
>>> My question is, what is the best way to do this merge?  Should we use
>>> the merge command with the size option set at something like 5G, or maybe
>>> use the compaction command?
>>>
>>> From my tests this process could take some time so I'm keen to
>>> understand the most efficient approach.
>>>
>>> Thanks in advance,
>>> Matt Dickson
>>>
>>
>>
>