I have a question about expected times for the importtsv program
My cluster has four nodes. All four machines are datanodes / regionservers /
tasktrackers, with one node also acting as namenode, jobtracker, and
hmaster. I'm running on Red Hat 5.5, 64gb ram, 2.8ghz, 8 cpus. I'm using the
hadoop 0.20 append branch and hbase-0.90.2.
I'm running the bulk importtsv program with output file option to import a
data file with 1 million rows. My table has about 200 columns and two column
families. Each line in the input file has about 2,000 bytes of data. I've
installed the HBase-1861 patch as well.
When I run this program, it's taking about 13 minutes to complete. I'm just
wondering if this time is expected or if we're possibly doing something.
There are 22 map tasks created and these complete quickly (all map tasks
complete in about 30 seconds). There is a single reduce task, and it takes
12 or so minutes to complete. On the box where this reduce job is running I
notice that CPU-usage is never more than 15-20%.
I don't really see much time improvement over just writing directly to the
table. Writing directly to the table (no output option) using importtsv took
about 15-20 minutes on average with the same data file.
Any advice would be greatly appreciated.