Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: Hbase import Tsv performance (slow import)


+
Nick maillard 2012-10-24, 11:40
+
ramkrishna vasudevan 2012-10-24, 13:47
+
Nick maillard 2012-10-24, 10:15
+
Sonal Goyal 2012-10-24, 11:18
+
Nick maillard 2012-10-24, 10:05
+
Nick maillard 2012-10-24, 09:23
+
Nick maillard 2012-10-24, 14:35
+
Kevin Odell 2012-10-24, 16:18
+
anil gupta 2012-10-24, 16:30
+
Nick maillard 2012-10-24, 16:29
+
nick maillard 2012-10-24, 19:08
+
Nick maillard 2012-10-23, 17:13
+
Nicolas Liochon 2012-10-23, 17:32
+
Kevin Odell 2012-10-23, 17:47
Copy link to this message
-
Re: Hbase import Tsv performance (slow import)
This is good advice Kevin we should add this to the HBase Reference Guide.

________________________________
 From: Kevin O'dell <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Tuesday, October 23, 2012 10:47 AM
Subject: Re: Hbase import Tsv performance (slow import)
 
You will want to make sure your table is pre-split.  Also Import does
puts, so you will want to make sure you are not flushing and blocking
by raising your memstore, Hlog, and blocking count.  This can greatly
improve your write speeds.  I usually do a 256MB memstore(you can
lower it later if it is not a heavy writes table), 512MB Hlog(same
thing, you can lower back to default), and then raise the storefile
blocking count to about 100.

On Tue, Oct 23, 2012 at 1:32 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:
> Thanks, checking the schema itself is still interesting (cf. the link sent)
> As well, with 3 machines and a replication factor of 3, all the machines
> are used during a write. As HBase writes all entries into a write-ahead-log
> for safety, the number of writes is also doubled. So may be your machine is
> just dying under the load. Anyway, here your cluster is going at the speed
> of the least powerful machine, and this machine has a workload multiplied
> by 6 compared to a single machine config (i.e. just writing a file locally).
>
> On Tue, Oct 23, 2012 at 7:13 PM, Nick maillard <
> [EMAIL PROTECTED]> wrote:
>
>> Thanks for the help!
>>
>> My conf files are : Hadoop:
>> hdfs-site
>>
>> <configuration>
>>  <property>
>>   <name>dfs.replication</name>
>>   <value>3</value
>>   <description>Default block replication.
>>   The actual number of replications can be specified when the file is
>> created.
>>   The default is used if replication is not specified in create time.
>>   </description>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/home/runner/app/hadoop/dfs/data</value>
>>   <description>Default block replication.
>>   The actual number of replications can be specified when the file is
>> created.
>>   The default is used if replication is not specified in create time.
>>   </description>
>> </property>
>> <property>
>>         <name>dfs.datanode.max.xcievers</name>
>>         <value>4096</value>
>>       </property>
>> </configuration>
>>
>>
>> Mapred-site.xml
>>
>> <configuration>
>>  <property>
>>   <name>mapred.job.tracker</name>
>>   <value>master:54311</value>
>>   <description>The host and port that the MapReduce job tracker runs
>>   at.  If "local", then jobs are run in-process as a single map
>>   and reduce task.
>>   </description>
>> </property>
>> <property>
>>   <name>mapred.tasktracker.map.tasks.maximum</name>
>>   <value>14</value>
>>   <description>The maximum number of map tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>>
>> <property>
>>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>   <value>14</value>
>>   <description>The maximum number of reduce tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>> <property>
>> <name>mapred.child.java.opts</name>
>>   <value>-Xmx400m</value>
>>   <description>Java opts for the task tracker child processes.
>>   The following symbol, if present, will be interpolated: @taskid@ is
>> replaced
>>   by current TaskID. Any other occurrences of '@' will go unchanged.
>>   For example, to enable verbose gc logging to a file named for the taskid
>> in
>>   /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
>>         -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc
>>
>>   The configuration variable mapred.child.ulimit can be used to control the
>>   maximum virtual memory of the child processes.
>>   </description>
>> </property>
>> </configuration>
>>
>>
>> core-site.xml
>>
>> <configuration>
>>  <property>
>>   <name>hadoop.tmp.dir</name>
>>   <value>/home/runner/app/hadoop/tmp</value>
>>   <description>A base for other temporary directories.</description>

Kevin O'Dell
Customer Operations Engineer, Cloudera
+
Nick maillard 2012-10-23, 15:48
+
Anoop John 2012-10-24, 03:29
+
ramkrishna vasudevan 2012-10-24, 04:55
+
anil gupta 2012-10-24, 05:09
+
Anoop John 2012-10-24, 05:11
+
Anoop John 2012-10-24, 05:14
+
anil gupta 2012-10-24, 05:28
+
Anoop John 2012-10-24, 06:07
+
anil gupta 2012-10-24, 06:14
+
Anoop John 2012-10-24, 06:31
+
anil gupta 2012-10-24, 06:43
+
ramkrishna vasudevan 2012-10-24, 05:52
+
anil gupta 2012-10-24, 06:11
+
Jonathan Bishop 2012-10-25, 15:57
+
anil gupta 2012-10-25, 20:33
+
anil gupta 2012-10-25, 20:35
+
Anoop Sam John 2012-10-26, 04:07
+
Nicolas Liochon 2012-10-23, 16:46