Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Hbase import Tsv performance (slow import)


Copy link to this message
-
Re: Hbase import Tsv performance (slow import)
Kevin O'dell 2012-10-23, 17:47
You will want to make sure your table is pre-split.  Also Import does
puts, so you will want to make sure you are not flushing and blocking
by raising your memstore, Hlog, and blocking count.  This can greatly
improve your write speeds.  I usually do a 256MB memstore(you can
lower it later if it is not a heavy writes table), 512MB Hlog(same
thing, you can lower back to default), and then raise the storefile
blocking count to about 100.

On Tue, Oct 23, 2012 at 1:32 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:
> Thanks, checking the schema itself is still interesting (cf. the link sent)
> As well, with 3 machines and a replication factor of 3, all the machines
> are used during a write. As HBase writes all entries into a write-ahead-log
> for safety, the number of writes is also doubled. So may be your machine is
> just dying under the load. Anyway, here your cluster is going at the speed
> of the least powerful machine, and this machine has a workload multiplied
> by 6 compared to a single machine config (i.e. just writing a file locally).
>
> On Tue, Oct 23, 2012 at 7:13 PM, Nick maillard <
> [EMAIL PROTECTED]> wrote:
>
>> Thanks for the help!
>>
>> My conf files are : Hadoop:
>> hdfs-site
>>
>> <configuration>
>>  <property>
>>   <name>dfs.replication</name>
>>   <value>3</value
>>   <description>Default block replication.
>>   The actual number of replications can be specified when the file is
>> created.
>>   The default is used if replication is not specified in create time.
>>   </description>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/home/runner/app/hadoop/dfs/data</value>
>>   <description>Default block replication.
>>   The actual number of replications can be specified when the file is
>> created.
>>   The default is used if replication is not specified in create time.
>>   </description>
>> </property>
>> <property>
>>         <name>dfs.datanode.max.xcievers</name>
>>         <value>4096</value>
>>       </property>
>> </configuration>
>>
>>
>> Mapred-site.xml
>>
>> <configuration>
>>  <property>
>>   <name>mapred.job.tracker</name>
>>   <value>master:54311</value>
>>   <description>The host and port that the MapReduce job tracker runs
>>   at.  If "local", then jobs are run in-process as a single map
>>   and reduce task.
>>   </description>
>> </property>
>> <property>
>>   <name>mapred.tasktracker.map.tasks.maximum</name>
>>   <value>14</value>
>>   <description>The maximum number of map tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>>
>> <property>
>>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>   <value>14</value>
>>   <description>The maximum number of reduce tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>> <property>
>> <name>mapred.child.java.opts</name>
>>   <value>-Xmx400m</value>
>>   <description>Java opts for the task tracker child processes.
>>   The following symbol, if present, will be interpolated: @taskid@ is
>> replaced
>>   by current TaskID. Any other occurrences of '@' will go unchanged.
>>   For example, to enable verbose gc logging to a file named for the taskid
>> in
>>   /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
>>         -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc
>>
>>   The configuration variable mapred.child.ulimit can be used to control the
>>   maximum virtual memory of the child processes.
>>   </description>
>> </property>
>> </configuration>
>>
>>
>> core-site.xml
>>
>> <configuration>
>>  <property>
>>   <name>hadoop.tmp.dir</name>
>>   <value>/home/runner/app/hadoop/tmp</value>
>>   <description>A base for other temporary directories.</description>
>> </property>
>>
>> <property>
>>   <name>fs.default.name</name>
>>   <value>hdfs://master:54310</value>
>>   <description>The name of the default file system.  A URI whose
>>   scheme and authority determine the FileSystem implementation.  The

Kevin O'Dell
Customer Operations Engineer, Cloudera