Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Splits and MapReduce


Copy link to this message
-
Re: Splits and MapReduce
Otis Gospodnetic 2012-05-17, 04:12
Leon, have a look at HBaseWD to solve key problems: https://github.com/sematext/HBaseWD#readme

Here is a post about it that includes some figures, some performance graphs, and code:

http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ 
Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 

>________________________________
> From: Leon Mergen <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Tuesday, May 15, 2012 10:53 AM
>Subject: Re: Splits and MapReduce
>
>Hello Himanish,
>
>Thanks for the advice. It looks like they are using a compound key of a
>"metric id" in addition to the timestamp:
>
>http://opentsdb.net/schema.html
>
>This sounds like a good solution for their use case but, unfortunately, we
>have a lot of MapReduce jobs which *only* filter based on the timestamp,
>and thus would result in a big table scan. However, I did find this little
>gem:
>
>https://bugzilla.mozilla.org/show_bug.cgi?id=566340
>
>It looks like the Mozilla Sorocco project ran into a similar issue, and
>they have chosen to use a salt for their row keys: prepend the timestamp
>with the first digit of an OOID to ensure a certain amount of parallelism
>when writing.
>
>What are the thoughts of the experts here about this solution ?
>
>
>Regards,
>
>Leon Mergen
>
>
>
>
>On Tue, May 15, 2012 at 4:28 PM, Himanish Kushary <[EMAIL PROTECTED]>wrote:
>
>> Hi,
>>
>> You could take a look into  *OpenTSDB* . I think they are addressing some
>> of the issues that you mention here.
>>
>> Thanks
>>
>>
>> On Tue, May 15, 2012 at 10:09 AM, Leon Mergen <[EMAIL PROTECTED]> wrote:
>>
>> > Hello all,
>> >
>> > We are currently orienting on HBase as a possible way to store our log
>> data
>> > in a structured way, and I want to verify a few things I was not able to
>> > find online. Specifically, what we are trying to achieve:
>> >
>> >  * be able to quickly search for logs within a specific time range;
>> >  * limit the amount of maps in our mapreduce jobs to only those areas
>> we're
>> > interested in.
>> >
>> > As I understand it, there is a tradeoff:
>> >
>> > * if you use a timestamp as a split key, be prepared for a tradeoff: a
>> > single region server can become a hotspot. This is bad when writing data
>> at
>> > a high load;
>> > * if we do not have the timestamp as the first key of the splitkeys, a
>> > MapReduce job will have to do a TableScan and have a huge amount of maps.
>> >
>> > Is there a known solution / workaround for this problem that people have
>> > used? Since our timespan queries are usually limited based on days, we
>> were
>> > considering adding a new table for each day, but that looked like a bit
>> of
>> > an ugly hack.
>> >
>> > Any ideas / suggestions about this ?
>> >
>> > Regards,
>> >
>> > Leon Mergen
>> >
>>
>>
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>
>