I was recently tasked with looking into a problem using Sqoop's
incremental import on our installation, namely that any imports after
the first would report success but the data would never appear. A
temporary file was created on HDFS with the data but deleted upon
completion rather than being moved into place.
It turned out to be a conflict between the "direct mode" database
manager (for PostgreSQL, in this case) and "incremental mode" import.
Ordinarily Sqoop ends up creating files named part-m-nnnnn where nnnnn
is an incrementing file partition number. However the direct mode
importer creates files of the form data-nnnnn. This poses a problem
because AppendUtils, which is used to move files into place at the end
of a direct import, only copies files which match that part-m-nnnnn
format and discards the rest.
I've written a patch which causes direct imports to use the same naming
convention elsewhere. Attached please also find some changes to
AppendUtils which improve resiliency especially if there happen to be
multiple concurrent operations on the same table. This patch is against
sqoop-1.3.0-cdh3u3 but seems to apply and build with minimal changes
across the whole 1.x series.
Please let me know if anyone finds this useful or if you have any
further suggestions. In particular I am curious where the
"part-m-nnnnn" naming comes from and if the "-m" signifies anything. I
did hunt around in order to find the code which creates those files but
with no luck.
Thanks and regards,
Jarek Jarcec Cecho 2013-06-13, 16:52
Tim Howe 2013-06-13, 18:22
Jarek Jarcec Cecho 2013-06-13, 18:31