Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Splits and MapReduce


Copy link to this message
-
Re: Splits and MapReduce
Leon, have a look at HBaseWD to solve key problems: https://github.com/sematext/HBaseWD#readme

Here is a post about it that includes some figures, some performance graphs, and code:

http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ 
Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 

>________________________________
> From: Leon Mergen <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Tuesday, May 15, 2012 10:53 AM
>Subject: Re: Splits and MapReduce
>
>Hello Himanish,
>
>Thanks for the advice. It looks like they are using a compound key of a
>"metric id" in addition to the timestamp:
>
>http://opentsdb.net/schema.html
>
>This sounds like a good solution for their use case but, unfortunately, we
>have a lot of MapReduce jobs which *only* filter based on the timestamp,
>and thus would result in a big table scan. However, I did find this little
>gem:
>
>https://bugzilla.mozilla.org/show_bug.cgi?id=566340
>
>It looks like the Mozilla Sorocco project ran into a similar issue, and
>they have chosen to use a salt for their row keys: prepend the timestamp
>with the first digit of an OOID to ensure a certain amount of parallelism
>when writing.
>
>What are the thoughts of the experts here about this solution ?
>
>
>Regards,
>
>Leon Mergen
>
>
>
>
>On Tue, May 15, 2012 at 4:28 PM, Himanish Kushary <[EMAIL PROTECTED]>wrote:
>
>> Hi,
>>
>> You could take a look into  *OpenTSDB* . I think they are addressing some
>> of the issues that you mention here.
>>
>> Thanks
>>
>>
>> On Tue, May 15, 2012 at 10:09 AM, Leon Mergen <[EMAIL PROTECTED]> wrote:
>>
>> > Hello all,
>> >
>> > We are currently orienting on HBase as a possible way to store our log
>> data
>> > in a structured way, and I want to verify a few things I was not able to
>> > find online. Specifically, what we are trying to achieve:
>> >
>> >  * be able to quickly search for logs within a specific time range;
>> >  * limit the amount of maps in our mapreduce jobs to only those areas
>> we're
>> > interested in.
>> >
>> > As I understand it, there is a tradeoff:
>> >
>> > * if you use a timestamp as a split key, be prepared for a tradeoff: a
>> > single region server can become a hotspot. This is bad when writing data
>> at
>> > a high load;
>> > * if we do not have the timestamp as the first key of the splitkeys, a
>> > MapReduce job will have to do a TableScan and have a huge amount of maps.
>> >
>> > Is there a known solution / workaround for this problem that people have
>> > used? Since our timespan queries are usually limited based on days, we
>> were
>> > considering adding a new table for each day, but that looked like a bit
>> of
>> > an ugly hack.
>> >
>> > Any ideas / suggestions about this ?
>> >
>> > Regards,
>> >
>> > Leon Mergen
>> >
>>
>>
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB