Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> FileNotFoundException in Reduce step when running importtsv program


Copy link to this message
-
FileNotFoundException in Reduce step when running importtsv program
Hi all,

I am attempting to bulk load data into HBase using the importtsv program. I
have a very wide table (about 200 columns, 2 column families), and right now
I'm trying to load in data from a single data file with 1 million rows.

Importtsv works fine for this data when I am writing directly to the table.
However, I would like the import to write to an output file, using the '*
importtsv.bulk.output*' option. I have installed the HBase 1861 patch (
https://issues.apache.org/jira/browse/HBASE-1861) to allow bulk upload with
multi-column families.

When I run the bulk upload program with the output file option on my data,
it always fails in the reduce step. There are a large number of reduce tasks
(2956) that get created. These tasks all get to about 35% completion and
then fail with the following error:
2011-03-17 11:52:48,095 WARN org.apache.hadoop.mapred.TaskTracker: Error
> running child
> java.io.FileNotFoundException: File does not exist:
> hdfs://master:9000/awardsData/_temporary/_attempt_201103151859_0066_r_000000_0
>  at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:468)
> at
> org.apache.hadoop.hbase.regionserver.StoreFile.getUniqueFile(StoreFile.java:580)
>  at
> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.writeMetaData(HFileOutputFormat.java:186)
> at
> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.close(HFileOutputFormat.java:247)
>  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
> 2011-03-17 11:52:48,100 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
> cleanup for the task

I've put the full output of the reduce task attempt here:
http://pastebin.com/WMfqUwqC

I've tried running the program on a small table (3 column families,
inserting 3 values each for 1 million rows) and it works fine, though it
only creates 1 reduce task for this.

Any idea what the problem could be?

FYI, my cluster has 4 nodes all acting as datanodes/regionservers, running
on 64-bit Red Hat Linux. I'm running the hadoop-0.20-append branch, and for
hbase, the latest revision of the 0.90.2 branch.

Thanks for your help,
Nichole
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB