Terry P. 2013-08-02, 21:41
Eric Newton 2013-08-02, 22:35
Thanks Eric. I came into work today after kicking off a 100 million test
data load and was pleasantly surprised to find the following distribution:
server1: 33.6 million docs,
server2: 32.8 million docs
server3: 33.6 million docs
So it looks like my 5 million record load just didn't get big enough to
need to split (and now I recall that my 5M load was using 500 byte records
rather than 1000 byte as I was later informed is closer to reality).
With a replication factor of 3 and 3 nodes, total consumed space is 248GB,
so compression looks to be about 18% for this random data. Real data will
compress better I'm sure, as this test data is just the RandomBatchWriter
tweaked to use a UUID as the RowKey to better match our app instead of the
monotonically increasing number. Hopefully later today the full test suite
will be ready so I can ingest real data.
Thanks for the addsplits syntax example -- I like that idea more than
working with a splits file, as it's easier to script and one less
dependency if you will. I'll presplit with that and re-test and see if the
distribution occurs sooner than I was seeing last week.
Thanks again Eric, the info you and the other folks on this list give out
every week is invaluable.
On Fri, Aug 2, 2013 at 5:35 PM, Eric Newton <[EMAIL PROTECTED]> wrote:
> Apparently 5M 1K documents isn't enough to split the tablet. I'm guessing
> that your documents are compressing well, or you are able to fit them all
> in memory. You could try flushing the table and see if it splits.
> shell > flush -t table -w
> Or, you could just add splits if you know the UUIDs are uniformly
> shell > addsplits -t table 1 2 3 4 5 6 7 8 9 a b c d e f
> Or, if you just want accumulo to split at a certain size under the 1G
> shell > config -t table -s table.split.threshold=10M
> On Fri, Aug 2, 2013 at 5:41 PM, Terry P. <[EMAIL PROTECTED]> wrote:
>> Greetings folks,
>> Have a bit of a non-typical Accumulo use case using Accumulo as a backend
>> data store for a search index to provide fault tolerance should the index
>> get corrupted. Max docs stored in Accumulo will be under 1 billion at full
>> The search index is used to "find" the data a user is interested in, and
>> the search index then retrieves the document from Accumulo using its RowKey
>> which was gotten from the search index. The RowKey is a java.util.UUID
>> string that has had the '-' dashes stripped out.
>> I have a 3 node cluster and as a quick test have ingested 5 million 1K
>> documents into it, yet they all went to a single TabletServer. I was kind
>> of surprised -- I knew this would be the case for a row key using a
>> monotonically increasing number, but I thought with a UUID type rowkey the
>> entries would have been spread across the TabletServers at least some, even
>> without pre-splitting the table.
>> Clearly my understanding of how Accumulo spreads the data out is lacking.
>> Can anyone shed more light on it? And possibly recommend a table split
>> strategy for a 3-node cluster such as I have described?
>> Many thanks in advance,