Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Re: Reverse Index Timestamp


Copy link to this message
-
Re: Reverse Index Timestamp
Roshan,

Depending on what your cluster setup is and what the resolution of the time
stamp is you could do something like this to spread the data around:

<timestamp-LSBs>-<string>-<reverse timestamp>

Using the LSBs of the timestamp as a uniform hash, then splitting on all
possible hashes would spread things around a bit. If you do this, then all
scans must check all hashes for data.

On Tue, Nov 27, 2012 at 1:25 PM, Keith Turner <[EMAIL PROTECTED]> wrote:

>
>
> On Tue, Nov 27, 2012 at 1:22 PM, Roshan Punnoose <[EMAIL PROTECTED]>wrote:
>
>> Thanks!
>>
>> The fact that you are using a binary tree behind the scenes makes perfect
>> sense. Btw, what do you use in the standalone (non native) implementation?
>> Does it use a TreeMap?
>>
>
> When not using native code, ConcurrentSkipListMap is used.
>
>
>>
>>
>> On Tue, Nov 27, 2012 at 12:57 PM, Keith Turner <[EMAIL PROTECTED]> wrote:
>>
>>>
>>>
>>> On Tue, Nov 27, 2012 at 12:21 PM, Roshan Punnoose <[EMAIL PROTECTED]>wrote:
>>>
>>>> The <string> would most likely be a fixed set of strings that do not
>>>> change over time.
>>>>
>>>> My question is if it is bad to use a reverse index timestamp in the row
>>>> id? Will it cause problems with the tablet splitting, compaction, and
>>>> performance if the data is always being sent to the top of the tablet? If I
>>>> define a split as everything prefixed with <string>, then the ingest will
>>>> go to one tablet, but then I add a reverse timestamp in the row, and that
>>>> would mean I am always copying data to the top of the tablet. Will this
>>>> cause performance issues? Or is it better to append to a tablet?
>>>>
>>>
>>> I do not think it should matter. Inserts go into a C++ STL map on the
>>> tablet server if using the nativemap.   I think the implementation of that
>>> is a balanced binary tree.  So I do not think inserting at the beginning vs
>>> the end would make difference.  That being said, I do not think I have
>>> tried this so I do not know if there would be any suprises.  I would be
>>> interested in hearing about your experiences.
>>>
>>>
>>>>
>>>>
>>>> On Tue, Nov 27, 2012 at 11:51 AM, Keith Turner <[EMAIL PROTECTED]>wrote:
>>>>
>>>>>
>>>>>
>>>>> Keith
>>>>>
>>>>> On Tue, Nov 27, 2012 at 10:41 AM, Roshan Punnoose <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> I want to have a table where the row will consist of
>>>>>> "<string>-<reverse index timestamp>". But this means that the data is
>>>>>> always being prefixed to the beginning of the row (or tablet if the row is
>>>>>> large). Will this be a problem for compaction or performance?
>>>>>
>>>>>
>>>>> Can you tell me more about what <string> is?  For example is it a hash
>>>>> or does it come from the set "foo1","foo2","foo3".   How does it change
>>>>> over time?  I think the answer to your question depends on what <string> is.
>>>>>
>>>>>
>>>>>>
>>>>>> I don't know if I heard this correctly, but someone once mentioned
>>>>>> that making the row id the direct timestamp could cause performance issues
>>>>>> because data is always going to one tablet, but also because there is
>>>>>> trouble splitting since it always appends to the tablet. Is this true, is
>>>>>> it similar to what could happen if I am always prefixing to a tablet?
>>>>>>
>>>>>
>>>>> Yes using a timestamp for a row could cause data from many clients to
>>>>> always go to the same tablet, which would be bad for performance on a
>>>>> cluster.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> Roshan
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB