Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> performance of Get from MR Job


Copy link to this message
-
Re: performance of Get from MR Job
One thing we observed with a similar setup was that if we added a reducer
and then used something like HRegionPartitioner to partition the data, our
GET performance improved dramatically. While you take a hit for adding the
reducer, it was worth it in our case. We never quite figured out why that
helped improve GET performance so your mileage may vary. My personal
theory is that while it doesn't do anything for data locality, it does
wind up making the accesses more sequential which hbase apparently likes.
You could also look at batch GETs if you don't mind complicating your code
a bit. I have also been told that
https://issues.apache.org/jira/browse/HDFS-2246 will make a big difference
in this scenario if your setup supports it. We are on 90.4.

On 6/19/12 4:37 AM, "Marcin Cylke" <[EMAIL PROTECTED]> wrote:

>Hi
>
>I've run into some performance issues with my hadoop MapReduce Job.
>Basically what I'm doing with it is:
>
>- read data from HDFS file
>- the output goes also to HDFS file (multiple ones in my scenerio)
>- in my mapper I process each line and enrich it with some data read
>from HBase table (I do Get each time)
>- I don't use reducer
>
>The Get performance seems not that good. On Average it is ~17.5
>gets/second. Peaks are 100gets/sec (which would be desirable speed :)).
>The logs are from one node only. and the performance count also.
>
>My schema is nothing special - one ColumnFamily with 3 columns.  But I
>heavilly use timestamps. My table looks like this:
>
>{NAME => 'XYZ', FAMILIES => [{NAME => 'cf', BLOOMFILTER => 'NONE',
>REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS =>
>'2147483646',  true
> TTL => '2147483647', MIN_VERSIONS => '0', BLOCKSIZE => '65536',
>IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
>
>Look at number of VERSIONS.
>
>And my GETs are like this:
>Get get = new Get(Bytes.toBytes(key));
>            get.setMaxVersions(1);
>            get.setTimeRange(0, timestamp);
>            get.setCacheBlocks(false);
>            get.addFamily(Bytes.toBytes("cf"));
>            Result res = htable.get(get);
>
>I init that HTable like this:
>htable = new HTable(config, QUERY_TABLE_NAME);
>            htable.setAutoFlush(false);
>            htable.setWriteBufferSize(1024 * 1024 * 12);
>
>
>I've attached a sample of Get performance - first column is number of
>GETs, the second is a date.
>
>Could You suggest where I'm getting that performance penalty? What to
>look at to check if I'm not doing something stupid here, what kind of
>statistics?
>
>Regards
>Marcin
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB