Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> How to pre-split a table for UUID rowkeys


Copy link to this message
-
Re: How to pre-split a table for UUID rowkeys
Thanks Eric. I came into work today after kicking off a 100 million test
data load and was pleasantly surprised to find the following distribution:

server1: 33.6 million docs,
server2: 32.8 million docs
server3: 33.6 million docs

So it looks like my 5 million record load just didn't get big enough to
need to split (and now I recall that my 5M load was using 500 byte records
rather than 1000 byte as I was later informed is closer to reality).

With a replication factor of 3 and 3 nodes, total consumed space is 248GB,
so compression looks to be about 18% for this random data.  Real data will
compress better I'm sure, as this test data is just the RandomBatchWriter
tweaked to use a UUID as the RowKey to better match our app instead of the
monotonically increasing number.  Hopefully later today the full test suite
will be ready so I can ingest real data.

Thanks for the addsplits syntax example -- I like that idea more than
working with a splits file, as it's easier to script and one less
dependency if you will.  I'll presplit with that and re-test and see if the
distribution occurs sooner than I was seeing last week.

Thanks again Eric, the info you and the other folks on this list give out
every week is invaluable.

On Fri, Aug 2, 2013 at 5:35 PM, Eric Newton <[EMAIL PROTECTED]> wrote:

> Apparently 5M 1K documents isn't enough to split the tablet.  I'm guessing
> that your documents are compressing well, or you are able to fit them all
> in memory.  You could try flushing the table and see if it splits.
>
> shell > flush -t table -w
>
> Or, you could just add splits if you know the UUIDs are uniformly
> distributed:
>
> shell > addsplits -t table 1 2 3 4 5 6 7 8 9 a b c d e f
>
> Or, if you just want accumulo to split at a certain size under the 1G
> default:
>
> shell > config -t table -s table.split.threshold=10M
>
> -Eric
>
>
>
> On Fri, Aug 2, 2013 at 5:41 PM, Terry P. <[EMAIL PROTECTED]> wrote:
>
>> Greetings folks,
>> Have a bit of a non-typical Accumulo use case using Accumulo as a backend
>> data store for a search index to provide fault tolerance should the index
>> get corrupted.  Max docs stored in Accumulo will be under 1 billion at full
>> volume.
>>
>> The search index is used to "find" the data a user is interested in, and
>> the search index then retrieves the document from Accumulo using its RowKey
>> which was gotten from the search index.  The RowKey is a java.util.UUID
>> string that has had the '-' dashes stripped out.
>>
>> I have a 3 node cluster and as a quick test have ingested 5 million 1K
>> documents into it, yet they all went to a single TabletServer.  I was kind
>> of surprised -- I knew this would be the case for a row key using a
>> monotonically increasing number, but I thought with a UUID type rowkey the
>> entries would have been spread across the TabletServers at least some, even
>> without pre-splitting the table.
>>
>> Clearly my understanding of how Accumulo spreads the data out is lacking.
>>  Can anyone shed more light on it?  And possibly recommend a table split
>> strategy for a 3-node cluster such as I have described?
>>
>> Many thanks in advance,
>> Terry
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB