Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Performance of table with large number of column families


+
Anthony Fox 2012-11-09, 16:26
+
William Slacum 2012-11-09, 16:39
+
William Slacum 2012-11-09, 16:41
+
Anthony Fox 2012-11-09, 16:45
+
William Slacum 2012-11-09, 16:49
+
Anthony Fox 2012-11-09, 16:52
+
Anthony Fox 2012-11-09, 16:53
+
William Slacum 2012-11-09, 17:02
+
Anthony Fox 2012-11-09, 17:11
+
William Slacum 2012-11-09, 17:15
+
Anthony Fox 2012-11-09, 17:18
Copy link to this message
-
Re: Performance of table with large number of column families
I'll ask for someone to verify this comment for me (look @ u John W Vines),
but the bloom filter helps when you have a discrete number of column
families that will appear across many rows.

On Fri, Nov 9, 2012 at 12:18 PM, Anthony Fox <[EMAIL PROTECTED]> wrote:

> Ah, ok, I was under the impression that this would be really fast since I
> have a column family bloom filter turned on.  Is this not correct?
>
>
> On Fri, Nov 9, 2012 at 12:15 PM, William Slacum <
> [EMAIL PROTECTED]> wrote:
>
>> When I said smaller of tablets, I really mean smaller number of rows :)
>> My apologies.
>>
>> So if you're searching for a random column family in a table, like with a
>> `scan -c <cf>` in the shell, it will start at row 0 and work sequentially
>> up to row 10000000 until it finds the cf.
>>
>>
>> On Fri, Nov 9, 2012 at 12:11 PM, Anthony Fox <[EMAIL PROTECTED]>wrote:
>>
>>> This scan is without the intersecting iterator.  I'm just trying to pull
>>> back a single data record at the moment which corresponds to scanning for
>>> one column family.  I'll try with a smaller number of tablets, but is the
>>> computation effort the same for the scan I am doing?
>>>
>>>
>>> On Fri, Nov 9, 2012 at 12:02 PM, William Slacum <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> So that means you have roughly 312.5k rows per tablet, which means
>>>> about 725k column families in any given tablet. The intersecting iterator
>>>> will work at a row per time, so I think at any given moment, it will be
>>>> working through 32 at a time and doing a linear scan through the RFile
>>>> blocks. With RFile indices, that check is usually pretty fast, but you're
>>>> having go through 4 orders of magnitude more data sequentially than you can
>>>> work on. If you can experiment and re-ingest with a smaller number of
>>>> tablets, anywhere between 15 and 45, I think you will see better
>>>> performance.
>>>>
>>>> On Fri, Nov 9, 2012 at 11:53 AM, Anthony Fox <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Failed to answer the original question - 15 tablet servers, 32
>>>>> tablets/splits.
>>>>>
>>>>>
>>>>> On Fri, Nov 9, 2012 at 11:52 AM, Anthony Fox <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> I've tried a number of different settings of table.split.threshold.
>>>>>>  I started at 1G and bumped it down to 128M and the cf scan is still ~30
>>>>>> seconds for both.  I've also used less rows - 00000 to 99999 and still see
>>>>>> similar performance numbers.  I thought the column family bloom filter
>>>>>> would help deal with large row space but sparsely populated column space.
>>>>>>  Is that correct?
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 9, 2012 at 11:49 AM, William Slacum <
>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>> I'm more inclined to believe it's because you have to search across
>>>>>>> 10M different rows to find any given column family, since they're randomly,
>>>>>>> and possibly uniformly, distributed. How many tablets are you searching
>>>>>>> across?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 9, 2012 at 11:45 AM, Anthony Fox <[EMAIL PROTECTED]>wrote:
>>>>>>>
>>>>>>>> Yes, there are 10M possible partitions.  I do not have a hash from
>>>>>>>> value to partition, the data is essentially randomly balanced across all
>>>>>>>> the tablets.  Unlike the bloom filter and intersecting iterator examples, I
>>>>>>>> do not have locality groups turned on and I have data in the cq and the
>>>>>>>> value for both index entries and record entries.  Could this be the issue?
>>>>>>>>  Each record entry has approximately 30 column qualifiers with data in the
>>>>>>>> value for each.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Nov 9, 2012 at 11:41 AM, William Slacum <
>>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>>
>>>>>>>>> I guess assuming you have 10M possible partitions, if you're using
>>>>>>>>> a relatively uniform hash to generate your IDs, you'll average about 2 per
>>>>>>>>> partition. Do you have any index for term/value to partition? This will
>>>>
+
John Vines 2012-11-09, 17:41
+
Anthony Fox 2012-11-09, 18:02
+
John Vines 2012-11-09, 18:09
+
Anthony Fox 2012-11-09, 18:29
+
Eric Newton 2012-11-09, 18:32
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB