|
Nick maillard
2012-10-24, 11:40
ramkrishna vasudevan
2012-10-24, 13:47
Nick maillard
2012-10-24, 10:15
Sonal Goyal
2012-10-24, 11:18
Nick maillard
2012-10-24, 10:05
Nick maillard
2012-10-24, 09:23
Nick maillard
2012-10-24, 14:35
Kevin O'dell
2012-10-24, 16:18
anil gupta
2012-10-24, 16:30
Nick maillard
2012-10-24, 16:29
nick maillard
2012-10-24, 19:08
Nick maillard
2012-10-23, 17:13
Nicolas Liochon
2012-10-23, 17:32
Kevin O'dell
2012-10-23, 17:47
lars hofhansl
2012-10-25, 04:10
Nick maillard
2012-10-23, 15:48
Anoop John
2012-10-24, 03:29
ramkrishna vasudevan
2012-10-24, 04:55
anil gupta
2012-10-24, 05:09
Anoop John
2012-10-24, 05:11
Anoop John
2012-10-24, 05:14
anil gupta
2012-10-24, 05:28
Anoop John
2012-10-24, 06:07
anil gupta
2012-10-24, 06:14
Anoop John
2012-10-24, 06:31
anil gupta
2012-10-24, 06:43
ramkrishna vasudevan
2012-10-24, 05:52
anil gupta
2012-10-24, 06:11
Jonathan Bishop
2012-10-25, 15:57
anil gupta
2012-10-25, 20:33
anil gupta
2012-10-25, 20:35
Anoop Sam John
2012-10-26, 04:07
Nicolas Liochon
2012-10-23, 16:46
|
-
Re: Hbase import Tsv performance (slow import)Nick maillard 2012-10-24, 11:40
Looking my task logs there is a big gap in time I do not understand.
The task connects to zookeeper creates the entries and from: 2012-10-24 12:25:24 to 2012-10-24 13:08:03 logs nothing. Doing map reduce I guess. 2012-10-24 12:25:23,323 INFO org.apache.zookeeper.ClientCnxn: Sessionestablishment complete on server 2012-10-24 12:25:24,266 INFO org.apache.hadoop.hbase.mapreduce.TableOutputFormat: Created table instance for conf2_events 2012-10-24 12:25:24,361 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2012-10-24 12:25:24,461 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin org.apache.hadoop.util.LinuxResourceCalculatorPlugin@13394344 2012-10-24 12:25:24,615 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy Snappy native library not loaded 2012-10-24 13:08:03,738 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x13a91f1e41000c0 2012-10-24 13:08:03,751 INFO org.apache.zookeeper.ZooKeeper: Session:0x13a91f1e41000c0 closed 2012-10-24 13:08:03,751 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2012-10-24 13:08:03,751 INFO org.apache.hadoop.mapred.Task: Task:attempt_201210241044_0005_m_000000_0 is done. And is in the process of commiting Map reduce side the job is being run 2012-10-24 12:25:19,212 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_.. given task: attempt_201210241044_0005_m_000002_0 2012-10-24 12:25:19,308 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_.. given task: attempt_201210241044_0005_m_000012_0 2012-10-24 12:25:19,347 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_.. given task: attempt_201210241044_0005_m_000003_0 2012-10-24 12:25:19,510 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_.. given task: attempt_201210241044_0005_m_000010_0 2012-10-24 12:25:19,525 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201210241044_0005_m_899418193 given task: attempt_201210241044_0005_m_000007_0 2012-10-24 12:25:19,526 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201210241044_0005_m_-1509383641 given task: attempt_201210241044_0005_m_000001_0 2012-10-24 12:25:19,708 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201210241044_0005_m_-19778997 given task: attempt_201210241044_0005_m_000004_0 2012-10-24 12:25:19,822 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201210241044_0005_m_-4189743 given task: attempt_201210241044_0005_m_000009_0 2012-10-24 12:25:19,980 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201210241044_0005_m_-661677671 given task: attempt_201210241044_0005_m_000005_0 2012-10-24 12:25:20,044 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201210241044_0005_m_1898916331 given task: attempt_201210241044_0005_m_000000_0 2012-10-24 12:25:20,167 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201210241044_0005_m_1123667416 given task: attempt_201210241044_0005_m_000008_0 2012-10-24 12:25:20,392 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201210241044_0005_m_1621934208 given task: attempt_201210241044_0005_m_000006_0 2012-10-24 12:25:20,500 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201210241044_0005_m_-538140840 given task: attempt_201210241044_0005_m_000013_0 2012-10-24 12:25:20,602 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201210241044_0005_m_-1673565310 given task: attempt_201210241044_0005_m_000011_0 2012-10-24 12:25:27,566 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201210241044_0005_m_000012_0 0.005804179% 2012-10-24 12:25:27,719 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201210241044_0005_m_000002_0 0.005184336% 2012-10-24 12:25:27,745 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201210241044_0005_m_000003_0 0.008510194% Datanode logs: 2012-10-24 12:26:31,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block blk_8158410273681796837_11398 at file /home/runner/app/hadoop/dfs/data/current/subdir41/blk_8158410273681796837 2012-10-24 12:26:32,576 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /37.59.44.188:36421, dest: /91.121.69.14:50010, bytes: 6543254, op: HDFS_WRITE, cliID: DFSClient_hb_rs_slave2,60020,1351068281823, offset: 0, srvID: DS-747375281-91.121.69.14-50010-1350487134487, blockid: blk_-6002773137274160991_11407, duration: 1749006835 2012-10-24 12:26:32,576 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-6002773137274160991_11407 terminating 2012-10-24 12:26:33,807 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /94.23.225.32:54226, dest: /91.121.69.14:50010, bytes: 25497785, op: HDFS_WRITE, cliID: DFSClient_hb_rs_slave2,60020,1351068281823, offset: 0, srvID: DS-747375281-91.121.69.14-50010-1350487134487, blockid: blk_-869989770332149129_11406, duration: 3505041135 2012-10-24 12:26:33,807 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-869989770332149129_11406 terminating 2012-10-24 12:26:34,165 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_3644089242601024939_11408 src: /37.59.44.188:36433 dest: /91.121.69.14:50010 2012-10-24 12:26:34,347 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block blk_-5745640874810358842_11401 file /home/runner/app/hadoop/dfs/data/current/subdir41/blk_-5745640874810358842 for deletion 2012-10-24 12:26:34,347 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block blk_1293587588119122949_11402 file /home/runner/app/hadoop/dfs/data/current/subdir41/blk_1293587588119122949 for deletion 2012-10-24 12:26:34,347 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block blk_2164482121237588223_11404 file /home/runner/app/hadoop/dfs/data/current/subdir41/blk_2164482121237588223 for deletion 2012-10-24 12:26:34,347 INFO org.apache.hadoo +
Nick maillard 2012-10-24, 11:40
-
Re: Hbase import Tsv performance (slow import)ramkrishna vasudevan 2012-10-24, 13:47
'Yeah, we never used HBase client api(puts) for loading a batch of millions
of records. Can you tell me by default where the o/p HFile(s) from MR job are stored in HDFS?' Hi Anil The o/p HFiles are stored in the path created for the corresponding HBase table. /table_name/store_name/region_name/file_name. The location will be the same that will be used when a normal flush thro HBase happens. Hope this helps. Regards Ram On Wed, Oct 24, 2012 at 5:10 PM, Nick maillard < [EMAIL PROTECTED]> wrote: > Looking my task logs there is a big gap in time I do not understand. > The task connects to zookeeper creates the entries and from: > 2012-10-24 12:25:24 to 2012-10-24 13:08:03 logs nothing. > Doing map reduce I guess. > > > 2012-10-24 12:25:23,323 INFO org.apache.zookeeper.ClientCnxn: > Sessionestablishment complete on server > 2012-10-24 12:25:24,266 INFO > org.apache.hadoop.hbase.mapreduce.TableOutputFormat: > Created table instance for conf2_events > 2012-10-24 12:25:24,361 INFO org.apache.hadoop.util.ProcessTree: > setsid exited with exit code 0 > 2012-10-24 12:25:24,461 INFO org.apache.hadoop.mapred.Task: > Using ResourceCalculatorPlugin > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@13394344 > 2012-10-24 12:25:24,615 WARN > org.apache.hadoop.io.compress.snappy.LoadSnappy > Snappy native library not loaded > 2012-10-24 13:08:03,738 INFO > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Closed zookeeper sessionid=0x13a91f1e41000c0 > 2012-10-24 13:08:03,751 INFO org.apache.zookeeper.ZooKeeper: > Session:0x13a91f1e41000c0 closed > 2012-10-24 13:08:03,751 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 2012-10-24 13:08:03,751 INFO org.apache.hadoop.mapred.Task: > Task:attempt_201210241044_0005_m_000000_0 is done. And is in the process of > commiting > > Map reduce side the job is being run > > 2012-10-24 12:25:19,212 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_.. > given task: attempt_201210241044_0005_m_000002_0 > 2012-10-24 12:25:19,308 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_.. > given task: attempt_201210241044_0005_m_000012_0 > 2012-10-24 12:25:19,347 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_.. > given task: attempt_201210241044_0005_m_000003_0 > > 2012-10-24 12:25:19,510 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_.. > given task: attempt_201210241044_0005_m_000010_0 > 2012-10-24 12:25:19,525 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_201210241044_0005_m_899418193 > given task: attempt_201210241044_0005_m_000007_0 > 2012-10-24 12:25:19,526 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_201210241044_0005_m_-1509383641 > given task: attempt_201210241044_0005_m_000001_0 > 2012-10-24 12:25:19,708 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_201210241044_0005_m_-19778997 > given task: attempt_201210241044_0005_m_000004_0 > 2012-10-24 12:25:19,822 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_201210241044_0005_m_-4189743 > given task: attempt_201210241044_0005_m_000009_0 > 2012-10-24 12:25:19,980 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_201210241044_0005_m_-661677671 > given task: attempt_201210241044_0005_m_000005_0 > 2012-10-24 12:25:20,044 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_201210241044_0005_m_1898916331 > given task: attempt_201210241044_0005_m_000000_0 > > 2012-10-24 12:25:20,167 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_201210241044_0005_m_1123667416 > given task: attempt_201210241044_0005_m_000008_0 > 2012-10-24 12:25:20,392 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_201210241044_0005_m_1621934208 > given task: attempt_201210241044_0005_m_000006_0 > 2012-10-24 12:25:20,500 INFO org.apache.hadoop.mapred.TaskTracker: > JVM with ID: jvm_201210241044_0005_m_-538140840 > given task: attempt_201210241044_0005_m_000013_0 > 2012-10-24 12:25:20,602 INFO org.apache.hadoop.mapred.TaskTracker: +
ramkrishna vasudevan 2012-10-24, 13:47
-
Re: Hbase import Tsv performance (slow import)Nick maillard 2012-10-24, 10:15
As I have written in a reply above but that is kind of lost in the tread:
I have set dfs.replication at 2 but this process time has not changed at all. How could I change my configuration to avoid this hotspot issue you have talked about. As Kevin has advised I have also upped: hbase.hstore.blockingStoreFiles to 100 hbase.hregion.memstore.block.multiplier to 7 hbase.hregion.memstore.flush.size to 256 MB hbase.regionserver.optionallogflushinterval to 30s These changes did not bring any significant evolution in speed. My cluster is 3 ubuntu machines: 2 cores 4 threads 3.4+ GHz with 16gb ram thanks for everyones help +
Nick maillard 2012-10-24, 10:15
-
Re: Hbase import Tsv performance (slow import)Sonal Goyal 2012-10-24, 11:18
Hi Nick,
Do you see anything in your tasktracker or datanode logs? Best Regards, Sonal Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Wed, Oct 24, 2012 at 3:45 PM, Nick maillard < [EMAIL PROTECTED]> wrote: > As I have written in a reply above but that is kind of lost in the tread: > > I have set dfs.replication at 2 but this process time has not changed at > all. > How could I change my configuration to avoid this hotspot issue you have > talked > about. > > As Kevin has advised I have also upped: > hbase.hstore.blockingStoreFiles to 100 > hbase.hregion.memstore.block.multiplier to 7 > hbase.hregion.memstore.flush.size to 256 MB > hbase.regionserver.optionallogflushinterval to 30s > > These changes did not bring any significant evolution in speed. > > My cluster is 3 ubuntu machines: > 2 cores 4 threads 3.4+ GHz with 16gb ram > > thanks for everyones help > > > +
Sonal Goyal 2012-10-24, 11:18
-
Re: Hbase import Tsv performance (slow import)Nick maillard 2012-10-24, 10:05
Hi John
I have 42 map tasks capacity and running an avg tasks/nodes 28. when I check the map job details there are 80 tasks to complete. As i drill down on the different map tasks in task detail they all take a very long time (26 minutes) to complete. A lot of them fail as well. Fail info is "failed to report status for 601 seconds" so time out. I does feel like an M/R related issue. I have tried running the hadoop wordcount example on the same 5GB HDFS file. The point was to get a feel of something only hadoop with no hbase associated. The process took a couple of minutes. I guess something in the imporTsv thru hbase call hangs up the map tasks. I don't really knwo where to look anymore to understand. Any idea of where of how or what to look for would be appreciated. As well any idea od different configuration I could try would be great. thanks in advance +
Nick maillard 2012-10-24, 10:05
-
Re: Hbase import Tsv performance (slow import)Nick maillard 2012-10-24, 09:23
Thanks for your help
I have taken my replication down to 2 but If I am not mistaken replication also has the benefit of rendering the cluster more fault by duplicating info on different nodes so that if one goes down data is note necessarily lost. I such case i would like to keep it a least at 2. I have set dfs.replication at 2 but this process time has not changed at all. How could I change my configuration to avoid this hotspot issue you talked about. As Kevin has advised I have also upped: hbase.hstore.blockingStoreFiles to 100 hbase.hregion.memstore.block.multiplier to 7 hbase.hregion.memstore.flush.size to 256 MB hbase.regionserver.optionallogflushinterval to 30s However map importTsv is still around 1minutes for 1% of map tasks so over an hour total. Currently I have 42 running map tasks and an average of 28 tasks/node a lot of my map tasks end up in "failed to report status for 601 seconds" My cluster is 3 ubuntu machines: 2 cores 4 threads 3.4+ GHz with 16gb ram With bulk load the process finishes in around 20 minutes. But I am suprised that it takes more than an hour to insert 5 GB of data in hbase without bulkload I feel there is something I'm not getting. +
Nick maillard 2012-10-24, 09:23
-
Re: Hbase import Tsv performance (slow import)Nick maillard 2012-10-24, 14:35
Hello everyone
Still looking in the issue. I have tried different tests and the results are surprising. If I put mapred.tasktracker.map.tasks.maximum: 28 I get a total of 84 tasks on my cluster and the process takes about 1h15 min each task taking up 1h10 minutes. The whole file being cut down in 80 tasks. If I put mapred.tasktracker.map.tasks.maximum: 3 I get a total of 6 tasks on my cluster and the process takes about the same amount of time 1h20 still cutting down the whole file in 80 tasks, but now of course each individual task only takes up a couple of minutes. It's like the overall importTSv must take 1h something and the duration of the map tasks vary accordingly. There is definitly something I am doing wrong. +
Nick maillard 2012-10-24, 14:35
-
Re: Hbase import Tsv performance (slow import)Kevin O'dell 2012-10-24, 16:18
Nick,
What versions are you using: HDFS HBase OS On Oct 24, 2012 10:36 AM, "Nick maillard" <[EMAIL PROTECTED]> wrote: > Hello everyone > > Still looking in the issue. > I have tried different tests and the results are surprising. > If I put mapred.tasktracker.map.tasks.maximum: 28 > I get a total of 84 tasks on my cluster and the process takes about 1h15 > min > each task taking up 1h10 minutes. The whole file being cut down in 80 > tasks. > > If I put mapred.tasktracker.map.tasks.maximum: 3 > I get a total of 6 tasks on my cluster and the process takes about the same > amount of time 1h20 still cutting down the whole file in 80 tasks, but now > of > course each individual task only takes up a couple of minutes. > > It's like the overall importTSv must take 1h something and the duration of > the > map tasks vary accordingly. > > There is definitly something I am doing wrong. > > > > +
Kevin O'dell 2012-10-24, 16:18
-
Re: Hbase import Tsv performance (slow import)anil gupta 2012-10-24, 16:30
Hi Nick,
How many hard drives your slaves has? RPM of those? How many mappers are run concurrently on a node?Did you turn off speculative execution? Have a look at disk i/o to see whether that is a bottleneck or not. MR is disk I/O bound so if you only have one disk on slave and you are running 5 Mapper concurrently then the job will slow down. Thanks, Anil On Wed, Oct 24, 2012 at 9:18 AM, Kevin O'dell <[EMAIL PROTECTED]>wrote: > Nick, > > What versions are you using: > > HDFS > HBase > OS > On Oct 24, 2012 10:36 AM, "Nick maillard" < > [EMAIL PROTECTED]> > wrote: > > > Hello everyone > > > > Still looking in the issue. > > I have tried different tests and the results are surprising. > > If I put mapred.tasktracker.map.tasks.maximum: 28 > > I get a total of 84 tasks on my cluster and the process takes about 1h15 > > min > > each task taking up 1h10 minutes. The whole file being cut down in 80 > > tasks. > > > > If I put mapred.tasktracker.map.tasks.maximum: 3 > > I get a total of 6 tasks on my cluster and the process takes about the > same > > amount of time 1h20 still cutting down the whole file in 80 tasks, but > now > > of > > course each individual task only takes up a couple of minutes. > > > > It's like the overall importTSv must take 1h something and the duration > of > > the > > map tasks vary accordingly. > > > > There is definitly something I am doing wrong. > > > > > > > > > -- Thanks & Regards, Anil Gupta +
anil gupta 2012-10-24, 16:30
-
Re: Hbase import Tsv performance (slow import)Nick maillard 2012-10-24, 16:29
Hello Kevin
I'm using : Hadoop 1.0.3 Hbase 0.94.2 OS:ubuntu 12.04 +
Nick maillard 2012-10-24, 16:29
-
Re: Hbase import Tsv performance (slow import)nick maillard 2012-10-24, 19:08
hi anil
I have one hard drive per slave. I have tested with 3 concurrent mappers and 28 concurrent mappers per slave. And both times the total time was about 1 hour the only difference was the time each map took aka respectfully 40min and 1h10min I have turned of the speculative execution. I'll run a process tomorrow and look at disk I/O to check if it is the bottleneck. But the test I ran this afternoon with 3 or 28 max map tasks per node makes me doubt. When I run 28 map per node I can load the whole file in the available maps in one pass so all maps take 1h to complete so the whole process takes 1h and some minutes. When I run with 3 maps per node the whole file is imported through 7 full passes of available maps. In this case each map takes around 8-9 minutes to complete. So 7 passes times 9 minutes, the process takes about 1hour to complete same as before. This situation i don't understand and leads me to believe I have missed a step somwhere. If someone has an idea I'll gladly look into anything +
nick maillard 2012-10-24, 19:08
-
Re: Hbase import Tsv performance (slow import)Nick maillard 2012-10-23, 17:13
Thanks for the help!
My conf files are : Hadoop: hdfs-site <configuration> <property> <name>dfs.replication</name> <value>3</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.data.dir</name> <value>/home/runner/app/hadoop/dfs/data</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property> </configuration> Mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>14</value> <description>The maximum number of map tasks that will be run simultaneously by a task tracker. </description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>14</value> <description>The maximum number of reduce tasks that will be run simultaneously by a task tracker. </description> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx400m</value> <description>Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes. </description> </property> </configuration> core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/runner/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> For Hbase: hbase-site: <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://master:54310/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> <description>The mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) </description> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2222</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>ks25937.kimsufi.com</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/runner/hbase/hbase-0.94.2/tmp</value> </property> </configuration> I am currently running import and looking at the logs to try and understand This seems definitely phishy: 2012-10-23 18:39:49,107 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201210231145_0010_m_000041_0 0.21332978% 2012-10-23 18:39:50,363 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201210231145_0010_m_000028_0 0.20936884% 2012-10-23 18:49:38,098 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201210231145_0010_m_000030_0: Task attempt_201210231145_0010_m_000030_0 failed to report status for 602 seconds. Killing! 2012-10-23 18:49:38,116 INFO org.apache.hadoop.mapred.TaskTracker: Process Thread Dump: lost task 90 active threads Thread 742 (process reaper): State: RUNNABLE Blocked count: 0 Waited count: 0 Stack: java.lang.UNIXProcess.waitForProcessExit(Native Method) java.lang.UNIXProcess.access$200(UNIXProcess.java:54) java.lang.UNIXProcess$3.run(UNIXProcess.java:174) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) Thread 740 (process reaper): State: RUNNABLE Blocked count: 0 Waited count: 0 Stack: java.lang.UNIXProcess.waitForProcessExit(Native Method) java.lang.UNIXProcess.access$200(UNIXProcess.java:54) java.lang.UNIXProcess$3.run(UNIXProcess.java:174) +
Nick maillard 2012-10-23, 17:13
-
Re: Hbase import Tsv performance (slow import)Nicolas Liochon 2012-10-23, 17:32
Thanks, checking the schema itself is still interesting (cf. the link sent)
As well, with 3 machines and a replication factor of 3, all the machines are used during a write. As HBase writes all entries into a write-ahead-log for safety, the number of writes is also doubled. So may be your machine is just dying under the load. Anyway, here your cluster is going at the speed of the least powerful machine, and this machine has a workload multiplied by 6 compared to a single machine config (i.e. just writing a file locally). On Tue, Oct 23, 2012 at 7:13 PM, Nick maillard < [EMAIL PROTECTED]> wrote: > Thanks for the help! > > My conf files are : Hadoop: > hdfs-site > > <configuration> > <property> > <name>dfs.replication</name> > <value>3</value > <description>Default block replication. > The actual number of replications can be specified when the file is > created. > The default is used if replication is not specified in create time. > </description> > </property> > <property> > <name>dfs.data.dir</name> > <value>/home/runner/app/hadoop/dfs/data</value> > <description>Default block replication. > The actual number of replications can be specified when the file is > created. > The default is used if replication is not specified in create time. > </description> > </property> > <property> > <name>dfs.datanode.max.xcievers</name> > <value>4096</value> > </property> > </configuration> > > > Mapred-site.xml > > <configuration> > <property> > <name>mapred.job.tracker</name> > <value>master:54311</value> > <description>The host and port that the MapReduce job tracker runs > at. If "local", then jobs are run in-process as a single map > and reduce task. > </description> > </property> > <property> > <name>mapred.tasktracker.map.tasks.maximum</name> > <value>14</value> > <description>The maximum number of map tasks that will be run > simultaneously by a task tracker. > </description> > </property> > > <property> > <name>mapred.tasktracker.reduce.tasks.maximum</name> > <value>14</value> > <description>The maximum number of reduce tasks that will be run > simultaneously by a task tracker. > </description> > </property> > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx400m</value> > <description>Java opts for the task tracker child processes. > The following symbol, if present, will be interpolated: @taskid@ is > replaced > by current TaskID. Any other occurrences of '@' will go unchanged. > For example, to enable verbose gc logging to a file named for the taskid > in > /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: > -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc > > The configuration variable mapred.child.ulimit can be used to control the > maximum virtual memory of the child processes. > </description> > </property> > </configuration> > > > core-site.xml > > <configuration> > <property> > <name>hadoop.tmp.dir</name> > <value>/home/runner/app/hadoop/tmp</value> > <description>A base for other temporary directories.</description> > </property> > > <property> > <name>fs.default.name</name> > <value>hdfs://master:54310</value> > <description>The name of the default file system. A URI whose > scheme and authority determine the FileSystem implementation. The > uri's scheme determines the config property (fs.SCHEME.impl) naming > the FileSystem implementation class. The uri's authority is used to > determine the host, port, etc. for a filesystem.</description> > </property> > > > For Hbase: > hbase-site: > <configuration> > <property> > <name>hbase.rootdir</name> > <value>hdfs://master:54310/hbase</value> > </property> > <property> > <name>hbase.cluster.distributed</name> > <value>true</value> > <description>The mode the cluster will be in. Possible values are > false: standalone and pseudo-distributed setups with managed > Zookeeper > true: fully-distributed with unmanaged Zookeeper Quorum (see +
Nicolas Liochon 2012-10-23, 17:32
-
Re: Hbase import Tsv performance (slow import)Kevin O'dell 2012-10-23, 17:47
You will want to make sure your table is pre-split. Also Import does
puts, so you will want to make sure you are not flushing and blocking by raising your memstore, Hlog, and blocking count. This can greatly improve your write speeds. I usually do a 256MB memstore(you can lower it later if it is not a heavy writes table), 512MB Hlog(same thing, you can lower back to default), and then raise the storefile blocking count to about 100. On Tue, Oct 23, 2012 at 1:32 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote: > Thanks, checking the schema itself is still interesting (cf. the link sent) > As well, with 3 machines and a replication factor of 3, all the machines > are used during a write. As HBase writes all entries into a write-ahead-log > for safety, the number of writes is also doubled. So may be your machine is > just dying under the load. Anyway, here your cluster is going at the speed > of the least powerful machine, and this machine has a workload multiplied > by 6 compared to a single machine config (i.e. just writing a file locally). > > On Tue, Oct 23, 2012 at 7:13 PM, Nick maillard < > [EMAIL PROTECTED]> wrote: > >> Thanks for the help! >> >> My conf files are : Hadoop: >> hdfs-site >> >> <configuration> >> <property> >> <name>dfs.replication</name> >> <value>3</value >> <description>Default block replication. >> The actual number of replications can be specified when the file is >> created. >> The default is used if replication is not specified in create time. >> </description> >> </property> >> <property> >> <name>dfs.data.dir</name> >> <value>/home/runner/app/hadoop/dfs/data</value> >> <description>Default block replication. >> The actual number of replications can be specified when the file is >> created. >> The default is used if replication is not specified in create time. >> </description> >> </property> >> <property> >> <name>dfs.datanode.max.xcievers</name> >> <value>4096</value> >> </property> >> </configuration> >> >> >> Mapred-site.xml >> >> <configuration> >> <property> >> <name>mapred.job.tracker</name> >> <value>master:54311</value> >> <description>The host and port that the MapReduce job tracker runs >> at. If "local", then jobs are run in-process as a single map >> and reduce task. >> </description> >> </property> >> <property> >> <name>mapred.tasktracker.map.tasks.maximum</name> >> <value>14</value> >> <description>The maximum number of map tasks that will be run >> simultaneously by a task tracker. >> </description> >> </property> >> >> <property> >> <name>mapred.tasktracker.reduce.tasks.maximum</name> >> <value>14</value> >> <description>The maximum number of reduce tasks that will be run >> simultaneously by a task tracker. >> </description> >> </property> >> <property> >> <name>mapred.child.java.opts</name> >> <value>-Xmx400m</value> >> <description>Java opts for the task tracker child processes. >> The following symbol, if present, will be interpolated: @taskid@ is >> replaced >> by current TaskID. Any other occurrences of '@' will go unchanged. >> For example, to enable verbose gc logging to a file named for the taskid >> in >> /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: >> -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc >> >> The configuration variable mapred.child.ulimit can be used to control the >> maximum virtual memory of the child processes. >> </description> >> </property> >> </configuration> >> >> >> core-site.xml >> >> <configuration> >> <property> >> <name>hadoop.tmp.dir</name> >> <value>/home/runner/app/hadoop/tmp</value> >> <description>A base for other temporary directories.</description> >> </property> >> >> <property> >> <name>fs.default.name</name> >> <value>hdfs://master:54310</value> >> <description>The name of the default file system. A URI whose >> scheme and authority determine the FileSystem implementation. The Kevin O'Dell Customer Operations Engineer, Cloudera +
Kevin O'dell 2012-10-23, 17:47
-
Re: Hbase import Tsv performance (slow import)lars hofhansl 2012-10-25, 04:10
This is good advice Kevin we should add this to the HBase Reference Guide.
________________________________ From: Kevin O'dell <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, October 23, 2012 10:47 AM Subject: Re: Hbase import Tsv performance (slow import) You will want to make sure your table is pre-split. Also Import does puts, so you will want to make sure you are not flushing and blocking by raising your memstore, Hlog, and blocking count. This can greatly improve your write speeds. I usually do a 256MB memstore(you can lower it later if it is not a heavy writes table), 512MB Hlog(same thing, you can lower back to default), and then raise the storefile blocking count to about 100. On Tue, Oct 23, 2012 at 1:32 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote: > Thanks, checking the schema itself is still interesting (cf. the link sent) > As well, with 3 machines and a replication factor of 3, all the machines > are used during a write. As HBase writes all entries into a write-ahead-log > for safety, the number of writes is also doubled. So may be your machine is > just dying under the load. Anyway, here your cluster is going at the speed > of the least powerful machine, and this machine has a workload multiplied > by 6 compared to a single machine config (i.e. just writing a file locally). > > On Tue, Oct 23, 2012 at 7:13 PM, Nick maillard < > [EMAIL PROTECTED]> wrote: > >> Thanks for the help! >> >> My conf files are : Hadoop: >> hdfs-site >> >> <configuration> >> <property> >> <name>dfs.replication</name> >> <value>3</value >> <description>Default block replication. >> The actual number of replications can be specified when the file is >> created. >> The default is used if replication is not specified in create time. >> </description> >> </property> >> <property> >> <name>dfs.data.dir</name> >> <value>/home/runner/app/hadoop/dfs/data</value> >> <description>Default block replication. >> The actual number of replications can be specified when the file is >> created. >> The default is used if replication is not specified in create time. >> </description> >> </property> >> <property> >>    <name>dfs.datanode.max.xcievers</name> >>    <value>4096</value> >>   </property> >> </configuration> >> >> >> Mapred-site.xml >> >> <configuration> >> <property> >> <name>mapred.job.tracker</name> >> <value>master:54311</value> >> <description>The host and port that the MapReduce job tracker runs >> at. If "local", then jobs are run in-process as a single map >> and reduce task. >> </description> >> </property> >> <property> >> <name>mapred.tasktracker.map.tasks.maximum</name> >> <value>14</value> >> <description>The maximum number of map tasks that will be run >> simultaneously by a task tracker. >> </description> >> </property> >> >> <property> >> <name>mapred.tasktracker.reduce.tasks.maximum</name> >> <value>14</value> >> <description>The maximum number of reduce tasks that will be run >> simultaneously by a task tracker. >> </description> >> </property> >> <property> >> <name>mapred.child.java.opts</name> >> <value>-Xmx400m</value> >> <description>Java opts for the task tracker child processes. >> The following symbol, if present, will be interpolated: @taskid@ is >> replaced >> by current TaskID. Any other occurrences of '@' will go unchanged. >> For example, to enable verbose gc logging to a file named for the taskid >> in >> /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: >>    -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc >> >> The configuration variable mapred.child.ulimit can be used to control the >> maximum virtual memory of the child processes. >> </description> >> </property> >> </configuration> >> >> >> core-site.xml >> >> <configuration> >> <property> >> <name>hadoop.tmp.dir</name> >> <value>/home/runner/app/hadoop/tmp</value> >> <description>A base for other temporary directories.</description> Kevin O'Dell Customer Operations Engineer, Cloudera +
lars hofhansl 2012-10-25, 04:10
-
Hbase import Tsv performance (slow import)Nick maillard 2012-10-23, 15:48
Hi everyone
I'm starting with hbase and testing for our needs. I have set up a hadoop cluster of Three machines and A Hbase cluster atop on the same three machines, one master two slaves. I am testing the Import of a 5GB csv file with the importTsv tool. I import the file in the HDFS and use the importTsv tool to import in Hbase. Right now it takes a little over an hour to complete. It creates around 2 million entries in one table with a single family. If I use bulk uploading it goes down to 20 minutes. My hadoop has 21 map tasks but they all seem to be taking a very long time to finish many tasks end up in time out. I am wondering what I have missed in my configuration. I have followed the different prerequisites in the documentations but I am really unsure as to what is causing this slow down. If I were to apply the wordcount example to the same file it takes only minutes to complete so I am guessing the issue lies in my Hbase configuration. Any help or pointers would by appreciated +
Nick maillard 2012-10-23, 15:48
-
Re: Hbase import Tsv performance (slow import)Anoop John 2012-10-24, 03:29
Hi
Using ImportTSV tool you are trying to bulk load your data. Can you see and tell how many mappers and reducers were there. Out of total time what is the time taken by the mapper phase and by the reducer phase. Seems like MR related issue (may be some conf issue). In this bulk load case most of the work is done by the MR job. It will read the raw data and convert it into Puts and write to HFiles. MR o/p is HFiles itself. The next part in ImportTSV will just put the HFiles under the table region store.. There wont be WAL usage in this bulk load. -Anoop- On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < [EMAIL PROTECTED]> wrote: > Hi everyone > > I'm starting with hbase and testing for our needs. I have set up a hadoop > cluster of Three machines and A Hbase cluster atop on the same three > machines, > one master two slaves. > > I am testing the Import of a 5GB csv file with the importTsv tool. I > import the > file in the HDFS and use the importTsv tool to import in Hbase. > > Right now it takes a little over an hour to complete. It creates around 2 > million entries in one table with a single family. > If I use bulk uploading it goes down to 20 minutes. > > My hadoop has 21 map tasks but they all seem to be taking a very long time > to > finish many tasks end up in time out. > > I am wondering what I have missed in my configuration. I have followed the > different prerequisites in the documentations but I am really unsure as to > what > is causing this slow down. If I were to apply the wordcount example to the > same > file it takes only minutes to complete so I am guessing the issue lies in > my > Hbase configuration. > > Any help or pointers would by appreciated > > +
Anoop John 2012-10-24, 03:29
-
Re: Hbase import Tsv performance (slow import)ramkrishna vasudevan 2012-10-24, 04:55
As Kevin suggested we can make use of bulk load that goes thro WAL and
Memstore. Or the second option will be to use the o/p of mappers to create HFiles directly. Regards Ram On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <[EMAIL PROTECTED]> wrote: > Hi > Using ImportTSV tool you are trying to bulk load your data. Can you see > and tell how many mappers and reducers were there. Out of total time what > is the time taken by the mapper phase and by the reducer phase. Seems like > MR related issue (may be some conf issue). In this bulk load case most of > the work is done by the MR job. It will read the raw data and convert it > into Puts and write to HFiles. MR o/p is HFiles itself. The next part in > ImportTSV will just put the HFiles under the table region store.. There > wont be WAL usage in this bulk load. > > -Anoop- > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < > [EMAIL PROTECTED]> wrote: > > > Hi everyone > > > > I'm starting with hbase and testing for our needs. I have set up a hadoop > > cluster of Three machines and A Hbase cluster atop on the same three > > machines, > > one master two slaves. > > > > I am testing the Import of a 5GB csv file with the importTsv tool. I > > import the > > file in the HDFS and use the importTsv tool to import in Hbase. > > > > Right now it takes a little over an hour to complete. It creates around 2 > > million entries in one table with a single family. > > If I use bulk uploading it goes down to 20 minutes. > > > > My hadoop has 21 map tasks but they all seem to be taking a very long > time > > to > > finish many tasks end up in time out. > > > > I am wondering what I have missed in my configuration. I have followed > the > > different prerequisites in the documentations but I am really unsure as > to > > what > > is causing this slow down. If I were to apply the wordcount example to > the > > same > > file it takes only minutes to complete so I am guessing the issue lies in > > my > > Hbase configuration. > > > > Any help or pointers would by appreciated > > > > > +
ramkrishna vasudevan 2012-10-24, 04:55
-
Re: Hbase import Tsv performance (slow import)anil gupta 2012-10-24, 05:09
Hi Anoop,
As per your last email, did you mean that WAL is not used while using HBase Bulk Loader? If yes, then how we ensure "no data loss" in case of RegionServer failure? Thanks, Anil Gupta On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < [EMAIL PROTECTED]> wrote: > As Kevin suggested we can make use of bulk load that goes thro WAL and > Memstore. Or the second option will be to use the o/p of mappers to create > HFiles directly. > > Regards > Ram > > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <[EMAIL PROTECTED]> wrote: > > > Hi > > Using ImportTSV tool you are trying to bulk load your data. Can you > see > > and tell how many mappers and reducers were there. Out of total time what > > is the time taken by the mapper phase and by the reducer phase. Seems > like > > MR related issue (may be some conf issue). In this bulk load case most of > > the work is done by the MR job. It will read the raw data and convert it > > into Puts and write to HFiles. MR o/p is HFiles itself. The next part in > > ImportTSV will just put the HFiles under the table region store.. There > > wont be WAL usage in this bulk load. > > > > -Anoop- > > > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < > > [EMAIL PROTECTED]> wrote: > > > > > Hi everyone > > > > > > I'm starting with hbase and testing for our needs. I have set up a > hadoop > > > cluster of Three machines and A Hbase cluster atop on the same three > > > machines, > > > one master two slaves. > > > > > > I am testing the Import of a 5GB csv file with the importTsv tool. I > > > import the > > > file in the HDFS and use the importTsv tool to import in Hbase. > > > > > > Right now it takes a little over an hour to complete. It creates > around 2 > > > million entries in one table with a single family. > > > If I use bulk uploading it goes down to 20 minutes. > > > > > > My hadoop has 21 map tasks but they all seem to be taking a very long > > time > > > to > > > finish many tasks end up in time out. > > > > > > I am wondering what I have missed in my configuration. I have followed > > the > > > different prerequisites in the documentations but I am really unsure as > > to > > > what > > > is causing this slow down. If I were to apply the wordcount example to > > the > > > same > > > file it takes only minutes to complete so I am guessing the issue lies > in > > > my > > > Hbase configuration. > > > > > > Any help or pointers would by appreciated > > > > > > > > > -- Thanks & Regards, Anil Gupta +
anil gupta 2012-10-24, 05:09
-
Re: Hbase import Tsv performance (slow import)Anoop John 2012-10-24, 05:11
Hi Anil
On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Anoop, > > As per your last email, did you mean that WAL is not used while using HBase > Bulk Loader? If yes, then how we ensure "no data loss" in case of > RegionServer failure? > > Thanks, > Anil Gupta > > On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > [EMAIL PROTECTED]> wrote: > > > As Kevin suggested we can make use of bulk load that goes thro WAL and > > Memstore. Or the second option will be to use the o/p of mappers to > create > > HFiles directly. > > > > Regards > > Ram > > > > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <[EMAIL PROTECTED]> > wrote: > > > > > Hi > > > Using ImportTSV tool you are trying to bulk load your data. Can you > > see > > > and tell how many mappers and reducers were there. Out of total time > what > > > is the time taken by the mapper phase and by the reducer phase. Seems > > like > > > MR related issue (may be some conf issue). In this bulk load case most > of > > > the work is done by the MR job. It will read the raw data and convert > it > > > into Puts and write to HFiles. MR o/p is HFiles itself. The next part > in > > > ImportTSV will just put the HFiles under the table region store.. > There > > > wont be WAL usage in this bulk load. > > > > > > -Anoop- > > > > > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Hi everyone > > > > > > > > I'm starting with hbase and testing for our needs. I have set up a > > hadoop > > > > cluster of Three machines and A Hbase cluster atop on the same three > > > > machines, > > > > one master two slaves. > > > > > > > > I am testing the Import of a 5GB csv file with the importTsv tool. I > > > > import the > > > > file in the HDFS and use the importTsv tool to import in Hbase. > > > > > > > > Right now it takes a little over an hour to complete. It creates > > around 2 > > > > million entries in one table with a single family. > > > > If I use bulk uploading it goes down to 20 minutes. > > > > > > > > My hadoop has 21 map tasks but they all seem to be taking a very long > > > time > > > > to > > > > finish many tasks end up in time out. > > > > > > > > I am wondering what I have missed in my configuration. I have > followed > > > the > > > > different prerequisites in the documentations but I am really unsure > as > > > to > > > > what > > > > is causing this slow down. If I were to apply the wordcount example > to > > > the > > > > same > > > > file it takes only minutes to complete so I am guessing the issue > lies > > in > > > > my > > > > Hbase configuration. > > > > > > > > Any help or pointers would by appreciated > > > > > > > > > > > > > > > > > -- > Thanks & Regards, > Anil Gupta > +
Anoop John 2012-10-24, 05:11
-
Re: Hbase import Tsv performance (slow import)Anoop John 2012-10-24, 05:14
Hi Anil
In case of bulk loading it is not like data is put into HBase one by one.. The MR job will create an o/p like HFile.. It will create the KVs and write to file in order as how HFile will look like.. The the file is loaded into HBase finally.. Only for this final step HBase RS will be used.. So there is no point in WAL there... I am making it clear for you? The data is already present in form of raw data in some txt or csv file :) -Anoop- On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <[EMAIL PROTECTED]> wrote: > Hi Anil > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <[EMAIL PROTECTED]>wrote: > >> Hi Anoop, >> >> As per your last email, did you mean that WAL is not used while using >> HBase >> Bulk Loader? If yes, then how we ensure "no data loss" in case of >> RegionServer failure? >> >> Thanks, >> Anil Gupta >> >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < >> [EMAIL PROTECTED]> wrote: >> >> > As Kevin suggested we can make use of bulk load that goes thro WAL and >> > Memstore. Or the second option will be to use the o/p of mappers to >> create >> > HFiles directly. >> > >> > Regards >> > Ram >> > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <[EMAIL PROTECTED]> >> wrote: >> > >> > > Hi >> > > Using ImportTSV tool you are trying to bulk load your data. Can >> you >> > see >> > > and tell how many mappers and reducers were there. Out of total time >> what >> > > is the time taken by the mapper phase and by the reducer phase. Seems >> > like >> > > MR related issue (may be some conf issue). In this bulk load case >> most of >> > > the work is done by the MR job. It will read the raw data and convert >> it >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The next part >> in >> > > ImportTSV will just put the HFiles under the table region store.. >> There >> > > wont be WAL usage in this bulk load. >> > > >> > > -Anoop- >> > > >> > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < >> > > [EMAIL PROTECTED]> wrote: >> > > >> > > > Hi everyone >> > > > >> > > > I'm starting with hbase and testing for our needs. I have set up a >> > hadoop >> > > > cluster of Three machines and A Hbase cluster atop on the same three >> > > > machines, >> > > > one master two slaves. >> > > > >> > > > I am testing the Import of a 5GB csv file with the importTsv tool. I >> > > > import the >> > > > file in the HDFS and use the importTsv tool to import in Hbase. >> > > > >> > > > Right now it takes a little over an hour to complete. It creates >> > around 2 >> > > > million entries in one table with a single family. >> > > > If I use bulk uploading it goes down to 20 minutes. >> > > > >> > > > My hadoop has 21 map tasks but they all seem to be taking a very >> long >> > > time >> > > > to >> > > > finish many tasks end up in time out. >> > > > >> > > > I am wondering what I have missed in my configuration. I have >> followed >> > > the >> > > > different prerequisites in the documentations but I am really >> unsure as >> > > to >> > > > what >> > > > is causing this slow down. If I were to apply the wordcount example >> to >> > > the >> > > > same >> > > > file it takes only minutes to complete so I am guessing the issue >> lies >> > in >> > > > my >> > > > Hbase configuration. >> > > > >> > > > Any help or pointers would by appreciated >> > > > >> > > > >> > > >> > >> >> >> >> -- >> Thanks & Regards, >> Anil Gupta >> > > +
Anoop John 2012-10-24, 05:14
-
Re: Hbase import Tsv performance (slow import)anil gupta 2012-10-24, 05:28
That's a very interesting fact. You made it clear but my custom Bulk Loader
generates an unique ID for every row in map phase. So, all my data is not in csv or text. Is there a way that i can explicitly turn on WAL for bulk loading? On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[EMAIL PROTECTED]> wrote: > Hi Anil > In case of bulk loading it is not like data is put into > HBase one by one.. The MR job will create an o/p like HFile.. It will > create the KVs and write to file in order as how HFile will look like.. The > the file is loaded into HBase finally.. Only for this final step HBase RS > will be used.. So there is no point in WAL there... I am making it clear > for you? The data is already present in form of raw data in some txt or > csv file :) > > -Anoop- > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <[EMAIL PROTECTED]> > wrote: > > > Hi Anil > > > > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <[EMAIL PROTECTED] > >wrote: > > > >> Hi Anoop, > >> > >> As per your last email, did you mean that WAL is not used while using > >> HBase > >> Bulk Loader? If yes, then how we ensure "no data loss" in case of > >> RegionServer failure? > >> > >> Thanks, > >> Anil Gupta > >> > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > >> [EMAIL PROTECTED]> wrote: > >> > >> > As Kevin suggested we can make use of bulk load that goes thro WAL and > >> > Memstore. Or the second option will be to use the o/p of mappers to > >> create > >> > HFiles directly. > >> > > >> > Regards > >> > Ram > >> > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <[EMAIL PROTECTED]> > >> wrote: > >> > > >> > > Hi > >> > > Using ImportTSV tool you are trying to bulk load your data. Can > >> you > >> > see > >> > > and tell how many mappers and reducers were there. Out of total time > >> what > >> > > is the time taken by the mapper phase and by the reducer phase. > Seems > >> > like > >> > > MR related issue (may be some conf issue). In this bulk load case > >> most of > >> > > the work is done by the MR job. It will read the raw data and > convert > >> it > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The next > part > >> in > >> > > ImportTSV will just put the HFiles under the table region store.. > >> There > >> > > wont be WAL usage in this bulk load. > >> > > > >> > > -Anoop- > >> > > > >> > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < > >> > > [EMAIL PROTECTED]> wrote: > >> > > > >> > > > Hi everyone > >> > > > > >> > > > I'm starting with hbase and testing for our needs. I have set up a > >> > hadoop > >> > > > cluster of Three machines and A Hbase cluster atop on the same > three > >> > > > machines, > >> > > > one master two slaves. > >> > > > > >> > > > I am testing the Import of a 5GB csv file with the importTsv > tool. I > >> > > > import the > >> > > > file in the HDFS and use the importTsv tool to import in Hbase. > >> > > > > >> > > > Right now it takes a little over an hour to complete. It creates > >> > around 2 > >> > > > million entries in one table with a single family. > >> > > > If I use bulk uploading it goes down to 20 minutes. > >> > > > > >> > > > My hadoop has 21 map tasks but they all seem to be taking a very > >> long > >> > > time > >> > > > to > >> > > > finish many tasks end up in time out. > >> > > > > >> > > > I am wondering what I have missed in my configuration. I have > >> followed > >> > > the > >> > > > different prerequisites in the documentations but I am really > >> unsure as > >> > > to > >> > > > what > >> > > > is causing this slow down. If I were to apply the wordcount > example > >> to > >> > > the > >> > > > same > >> > > > file it takes only minutes to complete so I am guessing the issue > >> lies > >> > in > >> > > > my > >> > > > Hbase configuration. > >> > > > > >> > > > Any help or pointers would by appreciated > >> > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> -- > >> Thanks & Regards, Thanks & Regards, Anil Gupta +
anil gupta 2012-10-24, 05:28
-
Re: Hbase import Tsv performance (slow import)Anoop John 2012-10-24, 06:07
>. Is there a way that i can explicitly turn on WAL for bulk loading?
no.. How you generate the unique id? Remember that initial steps wont need the HBase cluster at all. MR generates the HFiles and the o/p will be in file only.. Mappers also will write o/p to file... Only thing is that some mappers crashed.. So thin MR fw will run that mapper again on the same data set.. Then the unique id will be different? I think you no need to worry about data loss from Hbase side.. So WAL is not required.. -Anoop- On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <[EMAIL PROTECTED]> wrote: > That's a very interesting fact. You made it clear but my custom Bulk Loader > generates an unique ID for every row in map phase. So, all my data is not > in csv or text. Is there a way that i can explicitly turn on WAL for bulk > loading? > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[EMAIL PROTECTED]> > wrote: > > > Hi Anil > > In case of bulk loading it is not like data is put into > > HBase one by one.. The MR job will create an o/p like HFile.. It will > > create the KVs and write to file in order as how HFile will look like.. > The > > the file is loaded into HBase finally.. Only for this final step HBase RS > > will be used.. So there is no point in WAL there... I am making it clear > > for you? The data is already present in form of raw data in some txt or > > csv file :) > > > > -Anoop- > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <[EMAIL PROTECTED]> > > wrote: > > > > > Hi Anil > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <[EMAIL PROTECTED] > > >wrote: > > > > > >> Hi Anoop, > > >> > > >> As per your last email, did you mean that WAL is not used while using > > >> HBase > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case of > > >> RegionServer failure? > > >> > > >> Thanks, > > >> Anil Gupta > > >> > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > > >> [EMAIL PROTECTED]> wrote: > > >> > > >> > As Kevin suggested we can make use of bulk load that goes thro WAL > and > > >> > Memstore. Or the second option will be to use the o/p of mappers to > > >> create > > >> > HFiles directly. > > >> > > > >> > Regards > > >> > Ram > > >> > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <[EMAIL PROTECTED]> > > >> wrote: > > >> > > > >> > > Hi > > >> > > Using ImportTSV tool you are trying to bulk load your data. > Can > > >> you > > >> > see > > >> > > and tell how many mappers and reducers were there. Out of total > time > > >> what > > >> > > is the time taken by the mapper phase and by the reducer phase. > > Seems > > >> > like > > >> > > MR related issue (may be some conf issue). In this bulk load case > > >> most of > > >> > > the work is done by the MR job. It will read the raw data and > > convert > > >> it > > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The next > > part > > >> in > > >> > > ImportTSV will just put the HFiles under the table region store.. > > >> There > > >> > > wont be WAL usage in this bulk load. > > >> > > > > >> > > -Anoop- > > >> > > > > >> > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < > > >> > > [EMAIL PROTECTED]> wrote: > > >> > > > > >> > > > Hi everyone > > >> > > > > > >> > > > I'm starting with hbase and testing for our needs. I have set > up a > > >> > hadoop > > >> > > > cluster of Three machines and A Hbase cluster atop on the same > > three > > >> > > > machines, > > >> > > > one master two slaves. > > >> > > > > > >> > > > I am testing the Import of a 5GB csv file with the importTsv > > tool. I > > >> > > > import the > > >> > > > file in the HDFS and use the importTsv tool to import in Hbase. > > >> > > > > > >> > > > Right now it takes a little over an hour to complete. It creates > > >> > around 2 > > >> > > > million entries in one table with a single family. > > >> > > > If I use bulk uploading it goes down to 20 minutes. > > >> > > > +
Anoop John 2012-10-24, 06:07
-
Re: Hbase import Tsv performance (slow import)anil gupta 2012-10-24, 06:14
Anoop: Only thing is that some
mappers crashed.. So thin MR fw will run that mapper again on the same data set.. Then the unique id will be different? Anil: Yes, for the same dataset also the UniqueId will be different. UniqueID does not depends on the data. Thanks, Anil Gupta On Tue, Oct 23, 2012 at 11:07 PM, Anoop John <[EMAIL PROTECTED]> wrote: > >. Is there a way that i can explicitly turn on WAL for bulk loading? > no.. > How you generate the unique id? Remember that initial steps wont need the > HBase cluster at all. MR generates the HFiles and the o/p will be in file > only.. Mappers also will write o/p to file... Only thing is that some > mappers crashed.. So thin MR fw will run that mapper again on the same data > set.. Then the unique id will be different? I think you no need to worry > about data loss from Hbase side.. So WAL is not required.. > > -Anoop- > > > > > On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <[EMAIL PROTECTED]> > wrote: > > > That's a very interesting fact. You made it clear but my custom Bulk > Loader > > generates an unique ID for every row in map phase. So, all my data is not > > in csv or text. Is there a way that i can explicitly turn on WAL for bulk > > loading? > > > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[EMAIL PROTECTED]> > > wrote: > > > > > Hi Anil > > > In case of bulk loading it is not like data is put into > > > HBase one by one.. The MR job will create an o/p like HFile.. It will > > > create the KVs and write to file in order as how HFile will look like.. > > The > > > the file is loaded into HBase finally.. Only for this final step HBase > RS > > > will be used.. So there is no point in WAL there... I am making it > clear > > > for you? The data is already present in form of raw data in some txt > or > > > csv file :) > > > > > > -Anoop- > > > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hi Anil > > > > > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <[EMAIL PROTECTED] > > > >wrote: > > > > > > > >> Hi Anoop, > > > >> > > > >> As per your last email, did you mean that WAL is not used while > using > > > >> HBase > > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case of > > > >> RegionServer failure? > > > >> > > > >> Thanks, > > > >> Anil Gupta > > > >> > > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > > > >> [EMAIL PROTECTED]> wrote: > > > >> > > > >> > As Kevin suggested we can make use of bulk load that goes thro WAL > > and > > > >> > Memstore. Or the second option will be to use the o/p of mappers > to > > > >> create > > > >> > HFiles directly. > > > >> > > > > >> > Regards > > > >> > Ram > > > >> > > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John < > [EMAIL PROTECTED]> > > > >> wrote: > > > >> > > > > >> > > Hi > > > >> > > Using ImportTSV tool you are trying to bulk load your data. > > Can > > > >> you > > > >> > see > > > >> > > and tell how many mappers and reducers were there. Out of total > > time > > > >> what > > > >> > > is the time taken by the mapper phase and by the reducer phase. > > > Seems > > > >> > like > > > >> > > MR related issue (may be some conf issue). In this bulk load > case > > > >> most of > > > >> > > the work is done by the MR job. It will read the raw data and > > > convert > > > >> it > > > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The next > > > part > > > >> in > > > >> > > ImportTSV will just put the HFiles under the table region > store.. > > > >> There > > > >> > > wont be WAL usage in this bulk load. > > > >> > > > > > >> > > -Anoop- > > > >> > > > > > >> > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < > > > >> > > [EMAIL PROTECTED]> wrote: > > > >> > > > > > >> > > > Hi everyone > > > >> > > > > > > >> > > > I'm starting with hbase and testing for our needs. I have set > > up a > > > >> > hadoop > > > >> > > > cluster of Three machines and A Hbase cluster atop on the same Thanks & Regards, Anil Gupta +
anil gupta 2012-10-24, 06:14
-
Re: Hbase import Tsv performance (slow import)Anoop John 2012-10-24, 06:31
I think as per your explanation of need for unique id it is okey.. No need
to worry abt data loss.. As long as you can make sure you make a unique id things are fine.. MR will make sure it run the job on whole data and the o/p is persisted in file.. Yes this file is HFile(s) only.. Then finally the HBase cluster is used for loading the HFiles to the Region stores.. Bulk loading huge data using this way will be much much faster than normal put()s -Anoop- On Wed, Oct 24, 2012 at 11:44 AM, anil gupta <[EMAIL PROTECTED]> wrote: > Anoop: Only thing is that some > mappers crashed.. So thin MR fw will run that mapper again on the same data > set.. Then the unique id will be different? > > Anil: Yes, for the same dataset also the UniqueId will be different. > UniqueID does not depends on the data. > > Thanks, > Anil Gupta > > On Tue, Oct 23, 2012 at 11:07 PM, Anoop John <[EMAIL PROTECTED]> > wrote: > > > >. Is there a way that i can explicitly turn on WAL for bulk loading? > > no.. > > How you generate the unique id? Remember that initial steps wont need > the > > HBase cluster at all. MR generates the HFiles and the o/p will be in file > > only.. Mappers also will write o/p to file... Only thing is that some > > mappers crashed.. So thin MR fw will run that mapper again on the same > data > > set.. Then the unique id will be different? I think you no need to worry > > about data loss from Hbase side.. So WAL is not required.. > > > > -Anoop- > > > > > > > > > > On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <[EMAIL PROTECTED]> > > wrote: > > > > > That's a very interesting fact. You made it clear but my custom Bulk > > Loader > > > generates an unique ID for every row in map phase. So, all my data is > not > > > in csv or text. Is there a way that i can explicitly turn on WAL for > bulk > > > loading? > > > > > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hi Anil > > > > In case of bulk loading it is not like data is put > into > > > > HBase one by one.. The MR job will create an o/p like HFile.. It will > > > > create the KVs and write to file in order as how HFile will look > like.. > > > The > > > > the file is loaded into HBase finally.. Only for this final step > HBase > > RS > > > > will be used.. So there is no point in WAL there... I am making it > > clear > > > > for you? The data is already present in form of raw data in some > txt > > or > > > > csv file :) > > > > > > > > -Anoop- > > > > > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Hi Anil > > > > > > > > > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > >> Hi Anoop, > > > > >> > > > > >> As per your last email, did you mean that WAL is not used while > > using > > > > >> HBase > > > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case of > > > > >> RegionServer failure? > > > > >> > > > > >> Thanks, > > > > >> Anil Gupta > > > > >> > > > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > > > > >> [EMAIL PROTECTED]> wrote: > > > > >> > > > > >> > As Kevin suggested we can make use of bulk load that goes thro > WAL > > > and > > > > >> > Memstore. Or the second option will be to use the o/p of > mappers > > to > > > > >> create > > > > >> > HFiles directly. > > > > >> > > > > > >> > Regards > > > > >> > Ram > > > > >> > > > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John < > > [EMAIL PROTECTED]> > > > > >> wrote: > > > > >> > > > > > >> > > Hi > > > > >> > > Using ImportTSV tool you are trying to bulk load your > data. > > > Can > > > > >> you > > > > >> > see > > > > >> > > and tell how many mappers and reducers were there. Out of > total > > > time > > > > >> what > > > > >> > > is the time taken by the mapper phase and by the reducer > phase. > > > > Seems > > > > >> > like > > > > > +
Anoop John 2012-10-24, 06:31
-
Re: Hbase import Tsv performance (slow import)anil gupta 2012-10-24, 06:43
Yeah, we never used HBase client api(puts) for loading a batch of millions
of records. Can you tell me by default where the o/p HFile(s) from MR job are stored in HDFS? On Tue, Oct 23, 2012 at 11:31 PM, Anoop John <[EMAIL PROTECTED]> wrote: > I think as per your explanation of need for unique id it is okey.. No need > to worry abt data loss.. As long as you can make sure you make a unique id > things are fine.. MR will make sure it run the job on whole data and the > o/p is persisted in file.. Yes this file is HFile(s) only.. Then finally > the HBase cluster is used for loading the HFiles to the Region stores.. > Bulk loading huge data using this way will be much much faster than normal > put()s > > -Anoop- > > On Wed, Oct 24, 2012 at 11:44 AM, anil gupta <[EMAIL PROTECTED]> > wrote: > > > Anoop: Only thing is that some > > mappers crashed.. So thin MR fw will run that mapper again on the same > data > > set.. Then the unique id will be different? > > > > Anil: Yes, for the same dataset also the UniqueId will be different. > > UniqueID does not depends on the data. > > > > Thanks, > > Anil Gupta > > > > On Tue, Oct 23, 2012 at 11:07 PM, Anoop John <[EMAIL PROTECTED]> > > wrote: > > > > > >. Is there a way that i can explicitly turn on WAL for bulk loading? > > > no.. > > > How you generate the unique id? Remember that initial steps wont need > > the > > > HBase cluster at all. MR generates the HFiles and the o/p will be in > file > > > only.. Mappers also will write o/p to file... Only thing is that some > > > mappers crashed.. So thin MR fw will run that mapper again on the same > > data > > > set.. Then the unique id will be different? I think you no need to > worry > > > about data loss from Hbase side.. So WAL is not required.. > > > > > > -Anoop- > > > > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <[EMAIL PROTECTED]> > > > wrote: > > > > > > > That's a very interesting fact. You made it clear but my custom Bulk > > > Loader > > > > generates an unique ID for every row in map phase. So, all my data is > > not > > > > in csv or text. Is there a way that i can explicitly turn on WAL for > > bulk > > > > loading? > > > > > > > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Hi Anil > > > > > In case of bulk loading it is not like data is put > > into > > > > > HBase one by one.. The MR job will create an o/p like HFile.. It > will > > > > > create the KVs and write to file in order as how HFile will look > > like.. > > > > The > > > > > the file is loaded into HBase finally.. Only for this final step > > HBase > > > RS > > > > > will be used.. So there is no point in WAL there... I am making it > > > clear > > > > > for you? The data is already present in form of raw data in some > > txt > > > or > > > > > csv file :) > > > > > > > > > > -Anoop- > > > > > > > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > Hi Anil > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta < > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > >> Hi Anoop, > > > > > >> > > > > > >> As per your last email, did you mean that WAL is not used while > > > using > > > > > >> HBase > > > > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case > of > > > > > >> RegionServer failure? > > > > > >> > > > > > >> Thanks, > > > > > >> Anil Gupta > > > > > >> > > > > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > > > > > >> [EMAIL PROTECTED]> wrote: > > > > > >> > > > > > >> > As Kevin suggested we can make use of bulk load that goes thro > > WAL > > > > and > > > > > >> > Memstore. Or the second option will be to use the o/p of > > mappers > > > to > > > > > >> create > > > > > >> > HFiles directly. > > > > > >> > > > > > > >> > Regards > > > > > >> > Ram > > > > > >> > > > > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John < Thanks & Regards, Anil Gupta +
anil gupta 2012-10-24, 06:43
-
Re: Hbase import Tsv performance (slow import)ramkrishna vasudevan 2012-10-24, 05:52
Anil,
When you do ImportTSV the data that is present in the the TSV file alone will be parsed and loaded into HBase. How are you planning to generate the UniqueID? Your usecase seems like it your data is in CSV file but the unique id that you need is not part of the TSV. Now you need them to be loaded to HBASE thro WAL. I would suggest that can you first do a loading of the existing TSV file to one HTable. Then from that table you can do a bulk load into another table using ur custom mapper. Here you can use the logic of generating unique ID for every row that comes out from the loaded table. Here we can make the data to be inserted into the new table thro normal puts which will use the WAL and memstore. Regards Ram On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <[EMAIL PROTECTED]> wrote: > That's a very interesting fact. You made it clear but my custom Bulk Loader > generates an unique ID for every row in map phase. So, all my data is not > in csv or text. Is there a way that i can explicitly turn on WAL for bulk > loading? > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[EMAIL PROTECTED]> > wrote: > > > Hi Anil > > In case of bulk loading it is not like data is put into > > HBase one by one.. The MR job will create an o/p like HFile.. It will > > create the KVs and write to file in order as how HFile will look like.. > The > > the file is loaded into HBase finally.. Only for this final step HBase RS > > will be used.. So there is no point in WAL there... I am making it clear > > for you? The data is already present in form of raw data in some txt or > > csv file :) > > > > -Anoop- > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <[EMAIL PROTECTED]> > > wrote: > > > > > Hi Anil > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <[EMAIL PROTECTED] > > >wrote: > > > > > >> Hi Anoop, > > >> > > >> As per your last email, did you mean that WAL is not used while using > > >> HBase > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case of > > >> RegionServer failure? > > >> > > >> Thanks, > > >> Anil Gupta > > >> > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > > >> [EMAIL PROTECTED]> wrote: > > >> > > >> > As Kevin suggested we can make use of bulk load that goes thro WAL > and > > >> > Memstore. Or the second option will be to use the o/p of mappers to > > >> create > > >> > HFiles directly. > > >> > > > >> > Regards > > >> > Ram > > >> > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <[EMAIL PROTECTED]> > > >> wrote: > > >> > > > >> > > Hi > > >> > > Using ImportTSV tool you are trying to bulk load your data. > Can > > >> you > > >> > see > > >> > > and tell how many mappers and reducers were there. Out of total > time > > >> what > > >> > > is the time taken by the mapper phase and by the reducer phase. > > Seems > > >> > like > > >> > > MR related issue (may be some conf issue). In this bulk load case > > >> most of > > >> > > the work is done by the MR job. It will read the raw data and > > convert > > >> it > > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The next > > part > > >> in > > >> > > ImportTSV will just put the HFiles under the table region store.. > > >> There > > >> > > wont be WAL usage in this bulk load. > > >> > > > > >> > > -Anoop- > > >> > > > > >> > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < > > >> > > [EMAIL PROTECTED]> wrote: > > >> > > > > >> > > > Hi everyone > > >> > > > > > >> > > > I'm starting with hbase and testing for our needs. I have set > up a > > >> > hadoop > > >> > > > cluster of Three machines and A Hbase cluster atop on the same > > three > > >> > > > machines, > > >> > > > one master two slaves. > > >> > > > > > >> > > > I am testing the Import of a 5GB csv file with the importTsv > > tool. I > > >> > > > import the > > >> > > > file in the HDFS and use the importTsv tool to import in Hbase. > > >> > > > > > >> > > > Right now it takes a little over an hour to complete. It creates +
ramkrishna vasudevan 2012-10-24, 05:52
-
Re: Hbase import Tsv performance (slow import)anil gupta 2012-10-24, 06:11
Yes, the uniqueId is not part of csv file. In my bulk loader i use
combination of nodeId+processId+counter as UniqueID for each row. I have to use the uniqueId since the remaining part of rowkey is not unique. I think there are two approaches to solve this problem: 1. Generate HFiles through MR and then do incremental load. I am fine with this approach as we will have entire trace of data in HFiles. 2. Use prePut observers? I am already using the prePut hook for some other purpose. Thanks, Anil Gupta On Tue, Oct 23, 2012 at 10:52 PM, ramkrishna vasudevan < [EMAIL PROTECTED]> wrote: > Anil, > When you do ImportTSV the data that is present in the the TSV file alone > will be parsed and loaded into HBase. > How are you planning to generate the UniqueID? Your usecase seems like it > your data is in CSV file but the unique id that you need is not part of the > TSV. > Now you need them to be loaded to HBASE thro WAL. > > I would suggest that can you first do a loading of the existing TSV file to > one HTable. > Then from that table you can do a bulk load into another table using ur > custom mapper. Here you can use the logic of generating unique ID for > every row that comes out from the loaded table. > Here we can make the data to be inserted into the new table thro normal > puts which will use the WAL and memstore. > > Regards > Ram > > On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <[EMAIL PROTECTED]> > wrote: > > > That's a very interesting fact. You made it clear but my custom Bulk > Loader > > generates an unique ID for every row in map phase. So, all my data is not > > in csv or text. Is there a way that i can explicitly turn on WAL for bulk > > loading? > > > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[EMAIL PROTECTED]> > > wrote: > > > > > Hi Anil > > > In case of bulk loading it is not like data is put into > > > HBase one by one.. The MR job will create an o/p like HFile.. It will > > > create the KVs and write to file in order as how HFile will look like.. > > The > > > the file is loaded into HBase finally.. Only for this final step HBase > RS > > > will be used.. So there is no point in WAL there... I am making it > clear > > > for you? The data is already present in form of raw data in some txt > or > > > csv file :) > > > > > > -Anoop- > > > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hi Anil > > > > > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <[EMAIL PROTECTED] > > > >wrote: > > > > > > > >> Hi Anoop, > > > >> > > > >> As per your last email, did you mean that WAL is not used while > using > > > >> HBase > > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case of > > > >> RegionServer failure? > > > >> > > > >> Thanks, > > > >> Anil Gupta > > > >> > > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > > > >> [EMAIL PROTECTED]> wrote: > > > >> > > > >> > As Kevin suggested we can make use of bulk load that goes thro WAL > > and > > > >> > Memstore. Or the second option will be to use the o/p of mappers > to > > > >> create > > > >> > HFiles directly. > > > >> > > > > >> > Regards > > > >> > Ram > > > >> > > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John < > [EMAIL PROTECTED]> > > > >> wrote: > > > >> > > > > >> > > Hi > > > >> > > Using ImportTSV tool you are trying to bulk load your data. > > Can > > > >> you > > > >> > see > > > >> > > and tell how many mappers and reducers were there. Out of total > > time > > > >> what > > > >> > > is the time taken by the mapper phase and by the reducer phase. > > > Seems > > > >> > like > > > >> > > MR related issue (may be some conf issue). In this bulk load > case > > > >> most of > > > >> > > the work is done by the MR job. It will read the raw data and > > > convert > > > >> it > > > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The next > > > part > > > >> in > > > >> > > ImportTSV will just put the HFiles under the table region Thanks & Regards, Anil Gupta +
anil gupta 2012-10-24, 06:11
-
Re: Hbase import Tsv performance (slow import)Jonathan Bishop 2012-10-25, 15:57
Nicolas,
I just went through the same exercise. There are many ways to get this to go faster, but eventually I decided that bulk loading is the best solution as run times scaled with the number machines in my cluster when I used that approach. One thing you can try is to turn off hbase's write ahead log (WAL). But be aware that regionserver failure will cause data loss if you do this. Jon On Tue, Oct 23, 2012 at 8:48 AM, Nick maillard < [EMAIL PROTECTED]> wrote: > Hi everyone > > I'm starting with hbase and testing for our needs. I have set up a hadoop > cluster of Three machines and A Hbase cluster atop on the same three > machines, > one master two slaves. > > I am testing the Import of a 5GB csv file with the importTsv tool. I > import the > file in the HDFS and use the importTsv tool to import in Hbase. > > Right now it takes a little over an hour to complete. It creates around 2 > million entries in one table with a single family. > If I use bulk uploading it goes down to 20 minutes. > > My hadoop has 21 map tasks but they all seem to be taking a very long time > to > finish many tasks end up in time out. > > I am wondering what I have missed in my configuration. I have followed the > different prerequisites in the documentations but I am really unsure as to > what > is causing this slow down. If I were to apply the wordcount example to the > same > file it takes only minutes to complete so I am guessing the issue lies in > my > Hbase configuration. > > Any help or pointers would by appreciated > > +
Jonathan Bishop 2012-10-25, 15:57
-
Re: Hbase import Tsv performance (slow import)anil gupta 2012-10-25, 20:33
Hi Nicolas,
As per my experience you wont get good performance if you run 3 Map task simultaneously on one Hard Drive. That seems like a lot of I/O on one disk. HBase performs well when you have at least 5 nodes in cluster. So, running HBase on 3 nodes is not something you would do in prod. Thanks, Anil On Thu, Oct 25, 2012 at 8:57 AM, Jonathan Bishop <[EMAIL PROTECTED]>wrote: > Nicolas, > > I just went through the same exercise. There are many ways to get this to > go faster, but eventually I decided that bulk loading is the best solution > as run times scaled with the number machines in my cluster when I used that > approach. > > One thing you can try is to turn off hbase's write ahead log (WAL). But be > aware that regionserver failure will cause data loss if you do this. > > Jon > > On Tue, Oct 23, 2012 at 8:48 AM, Nick maillard < > [EMAIL PROTECTED]> wrote: > > > Hi everyone > > > > I'm starting with hbase and testing for our needs. I have set up a hadoop > > cluster of Three machines and A Hbase cluster atop on the same three > > machines, > > one master two slaves. > > > > I am testing the Import of a 5GB csv file with the importTsv tool. I > > import the > > file in the HDFS and use the importTsv tool to import in Hbase. > > > > Right now it takes a little over an hour to complete. It creates around 2 > > million entries in one table with a single family. > > If I use bulk uploading it goes down to 20 minutes. > > > > My hadoop has 21 map tasks but they all seem to be taking a very long > time > > to > > finish many tasks end up in time out. > > > > I am wondering what I have missed in my configuration. I have followed > the > > different prerequisites in the documentations but I am really unsure as > to > > what > > is causing this slow down. If I were to apply the wordcount example to > the > > same > > file it takes only minutes to complete so I am guessing the issue lies in > > my > > Hbase configuration. > > > > Any help or pointers would by appreciated > > > > > -- Thanks & Regards, Anil Gupta +
anil gupta 2012-10-25, 20:33
-
Re: Hbase import Tsv performance (slow import)anil gupta 2012-10-25, 20:35
@Jonathan,
As per Anoop and Ram, WAL is not used with bulk loading so turning off WAL wont have any impact on performance. On Thu, Oct 25, 2012 at 1:33 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Nicolas, > > As per my experience you wont get good performance if you run 3 Map task > simultaneously on one Hard Drive. That seems like a lot of I/O on one disk. > > HBase performs well when you have at least 5 nodes in cluster. So, running > HBase on 3 nodes is not something you would do in prod. > > Thanks, > Anil > > On Thu, Oct 25, 2012 at 8:57 AM, Jonathan Bishop <[EMAIL PROTECTED]>wrote: > >> Nicolas, >> >> I just went through the same exercise. There are many ways to get this to >> go faster, but eventually I decided that bulk loading is the best solution >> as run times scaled with the number machines in my cluster when I used >> that >> approach. >> >> One thing you can try is to turn off hbase's write ahead log (WAL). But be >> aware that regionserver failure will cause data loss if you do this. >> >> Jon >> >> On Tue, Oct 23, 2012 at 8:48 AM, Nick maillard < >> [EMAIL PROTECTED]> wrote: >> >> > Hi everyone >> > >> > I'm starting with hbase and testing for our needs. I have set up a >> hadoop >> > cluster of Three machines and A Hbase cluster atop on the same three >> > machines, >> > one master two slaves. >> > >> > I am testing the Import of a 5GB csv file with the importTsv tool. I >> > import the >> > file in the HDFS and use the importTsv tool to import in Hbase. >> > >> > Right now it takes a little over an hour to complete. It creates around >> 2 >> > million entries in one table with a single family. >> > If I use bulk uploading it goes down to 20 minutes. >> > >> > My hadoop has 21 map tasks but they all seem to be taking a very long >> time >> > to >> > finish many tasks end up in time out. >> > >> > I am wondering what I have missed in my configuration. I have followed >> the >> > different prerequisites in the documentations but I am really unsure as >> to >> > what >> > is causing this slow down. If I were to apply the wordcount example to >> the >> > same >> > file it takes only minutes to complete so I am guessing the issue lies >> in >> > my >> > Hbase configuration. >> > >> > Any help or pointers would by appreciated >> > >> > >> > > > > -- > Thanks & Regards, > Anil Gupta > -- Thanks & Regards, Anil Gupta +
anil gupta 2012-10-25, 20:35
-
RE: Hbase import Tsv performance (slow import)Anoop Sam John 2012-10-26, 04:07
>As per Anoop and Ram, WAL is not used with bulk loading so turning off WAL
wont have any impact on performance. This is if HFileOutputFormat is being used.. There is a TableOutputFormat which also can be used as the OutputFormat for MR.. Here write to wal is applicable This one, instead of write to HFile and upload at one shot, puts data into HTable calling put() method... -Anoop- ________________________________________ From: anil gupta [[EMAIL PROTECTED]] Sent: Friday, October 26, 2012 2:05 AM To: [EMAIL PROTECTED] Subject: Re: Hbase import Tsv performance (slow import) @Jonathan, As per Anoop and Ram, WAL is not used with bulk loading so turning off WAL wont have any impact on performance. On Thu, Oct 25, 2012 at 1:33 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Nicolas, > > As per my experience you wont get good performance if you run 3 Map task > simultaneously on one Hard Drive. That seems like a lot of I/O on one disk. > > HBase performs well when you have at least 5 nodes in cluster. So, running > HBase on 3 nodes is not something you would do in prod. > > Thanks, > Anil > > On Thu, Oct 25, 2012 at 8:57 AM, Jonathan Bishop <[EMAIL PROTECTED]>wrote: > >> Nicolas, >> >> I just went through the same exercise. There are many ways to get this to >> go faster, but eventually I decided that bulk loading is the best solution >> as run times scaled with the number machines in my cluster when I used >> that >> approach. >> >> One thing you can try is to turn off hbase's write ahead log (WAL). But be >> aware that regionserver failure will cause data loss if you do this. >> >> Jon >> >> On Tue, Oct 23, 2012 at 8:48 AM, Nick maillard < >> [EMAIL PROTECTED]> wrote: >> >> > Hi everyone >> > >> > I'm starting with hbase and testing for our needs. I have set up a >> hadoop >> > cluster of Three machines and A Hbase cluster atop on the same three >> > machines, >> > one master two slaves. >> > >> > I am testing the Import of a 5GB csv file with the importTsv tool. I >> > import the >> > file in the HDFS and use the importTsv tool to import in Hbase. >> > >> > Right now it takes a little over an hour to complete. It creates around >> 2 >> > million entries in one table with a single family. >> > If I use bulk uploading it goes down to 20 minutes. >> > >> > My hadoop has 21 map tasks but they all seem to be taking a very long >> time >> > to >> > finish many tasks end up in time out. >> > >> > I am wondering what I have missed in my configuration. I have followed >> the >> > different prerequisites in the documentations but I am really unsure as >> to >> > what >> > is causing this slow down. If I were to apply the wordcount example to >> the >> > same >> > file it takes only minutes to complete so I am guessing the issue lies >> in >> > my >> > Hbase configuration. >> > >> > Any help or pointers would by appreciated >> > >> > >> > > > > -- > Thanks & Regards, > Anil Gupta > -- Thanks & Regards, Anil Gupta +
Anoop Sam John 2012-10-26, 04:07
-
Re: Hbase import Tsv performance (slow import)Nicolas Liochon 2012-10-23, 16:46
Hi,
The schema design is important. There is this entry to look at at least: http://hbase.apache.org/book.html#rowkey.design For the config, could you pastebin the hdfs & hbase config files you used? N. On Tue, Oct 23, 2012 at 5:48 PM, Nick maillard < [EMAIL PROTECTED]> wrote: > Hi everyone > > I'm starting with hbase and testing for our needs. I have set up a hadoop > cluster of Three machines and A Hbase cluster atop on the same three > machines, > one master two slaves. > > I am testing the Import of a 5GB csv file with the importTsv tool. I > import the > file in the HDFS and use the importTsv tool to import in Hbase. > > Right now it takes a little over an hour to complete. It creates around 2 > million entries in one table with a single family. > If I use bulk uploading it goes down to 20 minutes. > > My hadoop has 21 map tasks but they all seem to be taking a very long time > to > finish many tasks end up in time out. > > I am wondering what I have missed in my configuration. I have followed the > different prerequisites in the documentations but I am really unsure as to > what > is causing this slow down. If I were to apply the wordcount example to the > same > file it takes only minutes to complete so I am guessing the issue lies in > my > Hbase configuration. > > Any help or pointers would by appreciated > > +
Nicolas Liochon 2012-10-23, 16:46
|