Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> Sqoop and S3


Hi,

I recently tried to use sqoop to export a Hive table that lives on S3 into my MySQL server (sqoop export --options-file config.txt --table _universe --export-dir s3n://key:secret@mybucket/universe --input-fields-terminated-by '\0001' -m 1 --input-null-string '\\N' --input-null-non-string '\\N'^C). My Sqoop runs on a CDH4 cluster on EC2. I was getting errors such as the following:

13/02/11 17:37:15 ERROR security.UserGroupInformation: PriviledgedActionException as:XXX (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /universe/000000_0.snappy
13/02/11 17:37:15 ERROR tool.ExportTool: Encountered IOException running export job: java.io.FileNotFoundException: File does not exist: /universe/000000_0.snappy

Since the files do exist on S3, I was reminded of getting the same errors when running Hive queries against this table. The reason Hive was failing back then is because of a bug in CombineFileInputFormat when using it against a non-default file system. These issues have since been fixed in Hadoop:
https://issues.apache.org/jira/browse/MAPREDUCE-1806
https://issues.apache.org/jira/browse/MAPREDUCE-2704
I believe Sqoop uses a version of CombineFileInputFormat but as far as I can tell from the latest sources on GIT hasn't incorporated the above fixes. My questions for the user group:
Am I completely off in my investigations?
Is there something I am missing in configuring Sqoop for exporting from S3?
Is there a way for me to bypass the CombineFileInputFormat so I can make my exports work?

Many thanks,

Jurgen

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB