Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - How to pre-split a table for UUID rowkeys


+
Terry P. 2013-08-02, 21:41
+
Eric Newton 2013-08-02, 22:35
Copy link to this message
-
Re: How to pre-split a table for UUID rowkeys
Terry P. 2013-08-05, 15:17
Thanks Eric. I came into work today after kicking off a 100 million test
data load and was pleasantly surprised to find the following distribution:

server1: 33.6 million docs,
server2: 32.8 million docs
server3: 33.6 million docs

So it looks like my 5 million record load just didn't get big enough to
need to split (and now I recall that my 5M load was using 500 byte records
rather than 1000 byte as I was later informed is closer to reality).

With a replication factor of 3 and 3 nodes, total consumed space is 248GB,
so compression looks to be about 18% for this random data.  Real data will
compress better I'm sure, as this test data is just the RandomBatchWriter
tweaked to use a UUID as the RowKey to better match our app instead of the
monotonically increasing number.  Hopefully later today the full test suite
will be ready so I can ingest real data.

Thanks for the addsplits syntax example -- I like that idea more than
working with a splits file, as it's easier to script and one less
dependency if you will.  I'll presplit with that and re-test and see if the
distribution occurs sooner than I was seeing last week.

Thanks again Eric, the info you and the other folks on this list give out
every week is invaluable.

On Fri, Aug 2, 2013 at 5:35 PM, Eric Newton <[EMAIL PROTECTED]> wrote:

> Apparently 5M 1K documents isn't enough to split the tablet.  I'm guessing
> that your documents are compressing well, or you are able to fit them all
> in memory.  You could try flushing the table and see if it splits.
>
> shell > flush -t table -w
>
> Or, you could just add splits if you know the UUIDs are uniformly
> distributed:
>
> shell > addsplits -t table 1 2 3 4 5 6 7 8 9 a b c d e f
>
> Or, if you just want accumulo to split at a certain size under the 1G
> default:
>
> shell > config -t table -s table.split.threshold=10M
>
> -Eric
>
>
>
> On Fri, Aug 2, 2013 at 5:41 PM, Terry P. <[EMAIL PROTECTED]> wrote:
>
>> Greetings folks,
>> Have a bit of a non-typical Accumulo use case using Accumulo as a backend
>> data store for a search index to provide fault tolerance should the index
>> get corrupted.  Max docs stored in Accumulo will be under 1 billion at full
>> volume.
>>
>> The search index is used to "find" the data a user is interested in, and
>> the search index then retrieves the document from Accumulo using its RowKey
>> which was gotten from the search index.  The RowKey is a java.util.UUID
>> string that has had the '-' dashes stripped out.
>>
>> I have a 3 node cluster and as a quick test have ingested 5 million 1K
>> documents into it, yet they all went to a single TabletServer.  I was kind
>> of surprised -- I knew this would be the case for a row key using a
>> monotonically increasing number, but I thought with a UUID type rowkey the
>> entries would have been spread across the TabletServers at least some, even
>> without pre-splitting the table.
>>
>> Clearly my understanding of how Accumulo spreads the data out is lacking.
>>  Can anyone shed more light on it?  And possibly recommend a table split
>> strategy for a 3-node cluster such as I have described?
>>
>> Many thanks in advance,
>> Terry
>>
>
>