Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Performance of table with large number of column families


+
Anthony Fox 2012-11-09, 16:26
+
William Slacum 2012-11-09, 16:39
+
William Slacum 2012-11-09, 16:41
+
Anthony Fox 2012-11-09, 16:45
+
William Slacum 2012-11-09, 16:49
+
Anthony Fox 2012-11-09, 16:52
+
Anthony Fox 2012-11-09, 16:53
+
William Slacum 2012-11-09, 17:02
+
Anthony Fox 2012-11-09, 17:11
+
William Slacum 2012-11-09, 17:15
+
Anthony Fox 2012-11-09, 17:18
+
William Slacum 2012-11-09, 17:23
+
John Vines 2012-11-09, 17:41
+
Anthony Fox 2012-11-09, 18:02
+
John Vines 2012-11-09, 18:09
Copy link to this message
-
Re: Performance of table with large number of column families
Do you mean two partitions per server?  In my case, that would correspond
to 30 total rows which would make each row very large ... >1G/row.  Should
I increase the table.split.threshold in a corresponding way?
On Fri, Nov 9, 2012 at 1:09 PM, John Vines <[EMAIL PROTECTED]> wrote:

> Glad to hear. I typically advice a minimum of 2 shards per tserver. I
> would say the maximum is actually based on the tablet size. Others in the
> country may disagree/provide better reasoning.
>
> Sent from my phone, pardon the typos and brevity.
> On Nov 9, 2012 1:03 PM, "Anthony Fox" <[EMAIL PROTECTED]> wrote:
>
>> Ok, I reingested with 1000 rows and performance for both single record
>> scans and index scans is much better.  I'm going to experiment a bit with
>> the optimal number of rows.  Thanks for the help, everyone.
>>
>>
>> On Fri, Nov 9, 2012 at 12:41 PM, John Vines <[EMAIL PROTECTED]> wrote:
>>
>>> The bloom filter checks only occur on a seek, and the way the column
>>> family filter works it's it seeks and then does a few scans to see if the
>>> appropriate families pop up in the short term. Bloom filter on the column
>>> family would be better if you had larger rows to encourage more
>>> seeks/minimize the number of rows to do bloom checks.
>>>
>>> The issue is that you are ultimately checking every single row for a
>>> column, which is sparse. It's not that different than doing a full table
>>> regex. If you had locality groups set up it would be more performant, until
>>> you create locality groups for everything.
>>>
>>> The intersecting iterators get their performance by being able to
>>> operate on large rows to avoid the penalty of checking each row. Minimize
>>> the number of partitions you have and it should clear up your issues.
>>>
>>> John
>>>
>>> Sent from my phone, pardon the typos and brevity.
>>> On Nov 9, 2012 12:24 PM, "William Slacum" <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> I'll ask for someone to verify this comment for me (look @ u John W
>>>> Vines), but the bloom filter helps when you have a discrete number of
>>>> column families that will appear across many rows.
>>>>
>>>> On Fri, Nov 9, 2012 at 12:18 PM, Anthony Fox <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Ah, ok, I was under the impression that this would be really fast
>>>>> since I have a column family bloom filter turned on.  Is this not correct?
>>>>>
>>>>>
>>>>> On Fri, Nov 9, 2012 at 12:15 PM, William Slacum <
>>>>> [EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> When I said smaller of tablets, I really mean smaller number of rows
>>>>>> :) My apologies.
>>>>>>
>>>>>> So if you're searching for a random column family in a table, like
>>>>>> with a `scan -c <cf>` in the shell, it will start at row 0 and work
>>>>>> sequentially up to row 10000000 until it finds the cf.
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 9, 2012 at 12:11 PM, Anthony Fox <[EMAIL PROTECTED]>wrote:
>>>>>>
>>>>>>> This scan is without the intersecting iterator.  I'm just trying to
>>>>>>> pull back a single data record at the moment which corresponds to scanning
>>>>>>> for one column family.  I'll try with a smaller number of tablets, but is
>>>>>>> the computation effort the same for the scan I am doing?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 9, 2012 at 12:02 PM, William Slacum <
>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>> So that means you have roughly 312.5k rows per tablet, which means
>>>>>>>> about 725k column families in any given tablet. The intersecting iterator
>>>>>>>> will work at a row per time, so I think at any given moment, it will be
>>>>>>>> working through 32 at a time and doing a linear scan through the RFile
>>>>>>>> blocks. With RFile indices, that check is usually pretty fast, but you're
>>>>>>>> having go through 4 orders of magnitude more data sequentially than you can
>>>>>>>> work on. If you can experiment and re-ingest with a smaller number of
>>>>>>>> tablets, anywhere between 15 and 45, I think you will see better
>>>>>>>> performance.
+
Eric Newton 2012-11-09, 18:32