Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Performance of table with large number of column families

Copy link to this message
Re: Performance of table with large number of column families
So that means you have roughly 312.5k rows per tablet, which means about
725k column families in any given tablet. The intersecting iterator will
work at a row per time, so I think at any given moment, it will be working
through 32 at a time and doing a linear scan through the RFile blocks. With
RFile indices, that check is usually pretty fast, but you're having go
through 4 orders of magnitude more data sequentially than you can work on.
If you can experiment and re-ingest with a smaller number of tablets,
anywhere between 15 and 45, I think you will see better performance.

On Fri, Nov 9, 2012 at 11:53 AM, Anthony Fox <[EMAIL PROTECTED]> wrote:

> Failed to answer the original question - 15 tablet servers, 32
> tablets/splits.
> On Fri, Nov 9, 2012 at 11:52 AM, Anthony Fox <[EMAIL PROTECTED]> wrote:
>> I've tried a number of different settings of table.split.threshold.  I
>> started at 1G and bumped it down to 128M and the cf scan is still ~30
>> seconds for both.  I've also used less rows - 00000 to 99999 and still see
>> similar performance numbers.  I thought the column family bloom filter
>> would help deal with large row space but sparsely populated column space.
>>  Is that correct?
>> On Fri, Nov 9, 2012 at 11:49 AM, William Slacum <
>> [EMAIL PROTECTED]> wrote:
>>> I'm more inclined to believe it's because you have to search across 10M
>>> different rows to find any given column family, since they're randomly, and
>>> possibly uniformly, distributed. How many tablets are you searching across?
>>> On Fri, Nov 9, 2012 at 11:45 AM, Anthony Fox <[EMAIL PROTECTED]>wrote:
>>>> Yes, there are 10M possible partitions.  I do not have a hash from
>>>> value to partition, the data is essentially randomly balanced across all
>>>> the tablets.  Unlike the bloom filter and intersecting iterator examples, I
>>>> do not have locality groups turned on and I have data in the cq and the
>>>> value for both index entries and record entries.  Could this be the issue?
>>>>  Each record entry has approximately 30 column qualifiers with data in the
>>>> value for each.
>>>> On Fri, Nov 9, 2012 at 11:41 AM, William Slacum <
>>>> [EMAIL PROTECTED]> wrote:
>>>>> I guess assuming you have 10M possible partitions, if you're using a
>>>>> relatively uniform hash to generate your IDs, you'll average about 2 per
>>>>> partition. Do you have any index for term/value to partition? This will
>>>>> help you narrow down your search space to a subset of your partitions.
>>>>> On Fri, Nov 9, 2012 at 11:39 AM, William Slacum <
>>>>> [EMAIL PROTECTED]> wrote:
>>>>>> That shouldn't be a huge issue. How many rows/partitions do you have?
>>>>>> How many do you have to scan to find the specific column family/doc id you
>>>>>> want?
>>>>>> On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox <[EMAIL PROTECTED]>wrote:
>>>>>>> I have a table set up to use the intersecting iterator pattern.  The
>>>>>>> table has about 20M records which leads to 20M column families for the
>>>>>>> data section - 1 unique column family per record.  The index section of
>>>>>>> the table is not quite as large as the data section.  The rowkey is a
>>>>>>> random padded integer partition between 0000000 and 9999999.  I turned
>>>>>>> bloom filters on and used the ColumnFamilyFunctor to get performant
>>>>>>> column family scans without specifying a range like in the bloom filter
>>>>>>> examples in the README.  However, my column family scans (without any
>>>>>>> custom iterator) are still fairly slow - ~30 seconds for a column family
>>>>>>> batch scan of one record. I've also tried RowFunctor but I see similar
>>>>>>> performance.  Can anyone shed any light on the performance metrics I'm
>>>>>>> seeing?
>>>>>>> Thanks,
>>>>>>> Anthony