Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Read thruput


+
Vibhav Mundra 2013-04-01, 10:09
+
Azuryy Yu 2013-04-01, 11:33
+
Ted Yu 2013-04-01, 18:33
+
Vibhav Mundra 2013-04-02, 06:26
Copy link to this message
-
Re: Read thruput
lars hofhansl 2013-04-02, 04:30
If you are concerned about latencies < 50ms you should disable Nagle's.

In hbase-site.xml:

  <property>
    <name>hbase.ipc.client.tcpnodelay</name>
    <value>true</value>
  </property>
  <property>
    <name>ipc.server.tcpnodelay</name>
    <value>true</value>
  </property>
You might get a further latency improvement if you do that same for HDFS:
In hdfs-site.xml:
<property>
  <name>ipc.server.tcpnodelay</name>
  <value>true</value>
</property>
<property>
  <name>ipc.client.tcpnodelay</name>
  <value>true</value>
</property>

Also (as other's have pointed out) you need to carefully control your garbage collections.
Watch the HDFS replication count (3 by default, which does not make any sense with only 2 DNs), but since your reading that should make no difference.
-- Lars

________________________________
 From: Vibhav Mundra <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, April 1, 2013 3:09 AM
Subject: Read thruput
 
Hi All,

I am trying to use Hbase for real-time data retrieval with a timeout of 50
ms.

I am using 2 machines as datanode and regionservers,
and one machine as a master for hadoop and Hbase.

But I am able to fire only 3000 queries per sec and 10% of them are timing
out.
The database has 60 million rows.

Are these figure okie, or I am missing something.
I have used the scanner caching to be equal to one, because for each time
we are fetching a single row only.

Here are the various configurations:

*Our schema
*{NAME => 'mytable', FAMILIES => [{NAME => 'cf', DATA_BLOCK_ENCODING =>
'NONE', BLOOMFILTER => 'ROWCOL', REPLICATION_SCOPE => '0', COMPRESSION =>
'GZ', VERSIONS => '1', TTL => '2147483647', MIN_VERSIONS => '0', KEE
P_DELETED_CELLS => 'false', BLOCKSIZE => '8192', ENCODE_ON_DISK => 'true',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

*Configuration*
1 Machine having both hbase and hadoop master
2 machines having both region server node and datanode
total 285 region servers

*Machine Level Optimizations:*
a)No of file descriptors is 1000000(ulimit -n gives 1000000)
b)Increase the read-ahead value to 4096
c)Added noatime,nodiratime to the disks

*Hadoop Optimizations:*
dfs.datanode.max.xcievers = 4096
dfs.block.size = 33554432
dfs.datanode.handler.count = 256
io.file.buffer.size = 65536
hadoop data is split on 4 directories, so that different disks are being
accessed

*Hbase Optimizations*:

hbase.client.scanner.caching=1  #We have specifcally added this, as we
return always one row.
hbase.regionserver.handler.count=3200
hfile.block.cache.size=0.35
hbase.hregion.memstore.mslab.enabled=true
hfile.min.blocksize.size=16384
hfile.min.blocksize.size=4
hbase.hstore.blockingStoreFiles=200
hbase.regionserver.optionallogflushinterval=60000
hbase.hregion.majorcompaction=0
hbase.hstore.compaction.max=100
hbase.hstore.compactionThreshold=100

*Hbase-GC
*-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=20 -XX:ParallelGCThreads=16
*Hadoop-GC*
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC

-Vibhav
+
ramkrishna vasudevan 2013-04-01, 10:16
+
Vibhav Mundra 2013-04-01, 10:47
+
Ted 2013-04-01, 10:53
+
Vibhav Mundra 2013-04-01, 11:57
+
Ted Yu 2013-04-01, 16:50
+
Vibhav Mundra 2013-04-01, 17:50
+
Vibhav Mundra 2013-04-01, 17:59
+
Asaf Mesika 2013-04-01, 20:12
+
Vibhav Mundra 2013-04-02, 06:36
+
Asaf Mesika 2013-04-04, 04:21