Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - Fwd: Tuning & Compactions


Copy link to this message
-
Fwd: Tuning & Compactions
Eric Newton 2012-12-06, 21:07
Keith noted that my response didn't go back to the whole list.

-Eric

---------- Forwarded message ----------
From: Eric Newton <[EMAIL PROTECTED]>
Date: Tue, Dec 4, 2012 at 2:25 PM
Subject: Re: Tuning & Compactions
To: [EMAIL PROTECTED]
By "small indexes"... I mean they are small to read off disk.  If you write
a gigabyte of indexes, it's going to take some time to read them into RAM.
The index is a sub-set of all the keys in the RFile.  If you have lots of
keys in the index, the lookups can be faster, but it takes more time to
load those keys into RAM.  Keep your keys small, and try to keep the
sub-set of keys in the index small so that first lookup is fast.  A million
index keys for a billion key/values is not unreasonable.  We have used even
smaller ratios, especially when the files to be imported are constructed to
fit the current split points.

You can have an infinite number of families and qualifiers.  However, if
you ever want to put families into locality groups, it's easier to
configure them if the number of families you want in the group is a small
number.  A group separates families by name.

Using the example from the google BigTable paper: you can store small
indexed items, like URLs, separately from large value items, like whole web
pages, which will give you faster search over the small items, while
logically keeping them in the same sorted index.  URLs would go into one
group, which would be stored separately from another group containing the
whole web page and maybe something like image data.  A search on URLs would
not need to decompress and skip over large values while scanning.  Further,
URLs are more similar to themselves, than they are to images, and so are
likely to compress better when stored together.

To complicate things further, Accumulo does not create separate files for
each family group, as implied in the BigTable paper.  They are stored in
separate sections of the RFile.  They are also created lazily: as the data
is re-written, they will gradually be organized according to the locality
group specifications.  You can force a re-write, if you like.

If you find yourself wanting to put extensions in the column family that
have nothing to do with locality groups, just move it over to the column
qualifier.  We put carefully structured, binary data in the column
qualifier all the time.

-Eric

On Tue, Dec 4, 2012 at 1:06 PM, Chris Burrell <[EMAIL PROTECTED]> wrote:

> Thanks for all the comments below. Very helpful!
>
> On the last point, around "small indexes", do you mean if your set of keys
> is small, but having many column-families and column qualifiers? What order
> of magnitude would you consider to be small? A few million keys/billion
> keys? Or in another way, keys with 10s/100s of column families/qualifiers.
>
> I have another question around the use of column families and qualifiers.
> Would it be good or bad practice to have many column families/qualifiers
> per row.  I was just wondering if there would be any point in using these
> almost as extensions to the keys, i.e. the column family/qualifier would
> end up being the last part of the key. I understand column families can
> also be used to control how the data gets stored to maximize scanning too.
> I was just wondering if there would be drawbacks on having many of these.
>
> Chris
>
>
>
> On 28 November 2012 20:31, Eric Newton <[EMAIL PROTECTED]> wrote:
>
>> Some comments inlined below:
>>
>> On Wed, Nov 28, 2012 at 2:49 PM, Chris Burrell <[EMAIL PROTECTED]>wrote:
>>
>>> Hi
>>>
>>> I am trialling Accumulo on a small (tiny) cluster and wondering how the
>>> best way to tune it would be. I have 1 master + 2 tservers. The master has
>>> 8Gb of RAM and the tservers have each 16Gb each.
>>>
>>> I have set the walogs size to be 2Gb with an external memory map of 9G.
>>> The ratio is still the defaulted to 3. I've also upped the heap sizes of
>>> each tserver to 2Gb heaps.
>>>
>>> I'm trying to achieve high-speed ingest via batch writers held on