Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> DBOutputFormat Speed Issues

Copy link to this message
Re: DBOutputFormat Speed Issues

I'm afraid that right now the only available OutputFormat for JDBC is that
one. You'll note that DBOutputFormat doesn't really include much support for
special-casing to MySQL or other targets.

Your best bet is to probably copy the code from DBOutputFormat and
DBConfiguration into some other class (e.g. MySQLDBOutputFormat) and modify
the code in the RecordWriter to generate PreparedStatements containing
batched insert statements.

If you arrive at a solution which is pretty general-purpose/robust, please
consider contributing it back to the Hadoop project :) If you do so, send me
an email off-list; I'm happy to help with advice on developing better DB
integration code, reviewing your work, etc.

Also on the input side, you should really be using DataDrivenDBInputFormat
instead of the older DBIF :) Sqoop (in src/contrib/sqoop on Apache 0.21 /
CDH 0.20) has pretty good support for parallel imports, and uses this
InputFormat instead.

- Aaron

On Thu, Jan 28, 2010 at 11:39 AM, Nick Jones <[EMAIL PROTECTED]> wrote:

> Hi all,
> I have a use case for collecting several rows from MySQL of
> compressed/unstructured data (n rows), expanding the data set, and storing
> the expanded results back into a MySQL DB (100,000n rows). DBInputFormat
> seems to perform reasonably well but DBOutputFormat is inserting rows
> one-by-one.  How can I take advantage of MySQL's support of generating fewer
> insert statements with more values within each one?
> Thanks.
> --
> Nick Jones