|
|
-
DBOutputFormat Speed Issues
Nick Jones 2010-01-28, 19:39
Hi all, I have a use case for collecting several rows from MySQL of compressed/unstructured data (n rows), expanding the data set, and storing the expanded results back into a MySQL DB (100,000n rows). DBInputFormat seems to perform reasonably well but DBOutputFormat is inserting rows one-by-one. How can I take advantage of MySQL's support of generating fewer insert statements with more values within each one?
Thanks. -- Nick Jones
-
Re: DBOutputFormat Speed Issues
Aaron Kimball 2010-01-31, 22:03
Nick,
I'm afraid that right now the only available OutputFormat for JDBC is that one. You'll note that DBOutputFormat doesn't really include much support for special-casing to MySQL or other targets.
Your best bet is to probably copy the code from DBOutputFormat and DBConfiguration into some other class (e.g. MySQLDBOutputFormat) and modify the code in the RecordWriter to generate PreparedStatements containing batched insert statements.
If you arrive at a solution which is pretty general-purpose/robust, please consider contributing it back to the Hadoop project :) If you do so, send me an email off-list; I'm happy to help with advice on developing better DB integration code, reviewing your work, etc.
Also on the input side, you should really be using DataDrivenDBInputFormat instead of the older DBIF :) Sqoop (in src/contrib/sqoop on Apache 0.21 / CDH 0.20) has pretty good support for parallel imports, and uses this InputFormat instead.
- Aaron
On Thu, Jan 28, 2010 at 11:39 AM, Nick Jones <[EMAIL PROTECTED]> wrote:
> Hi all, > I have a use case for collecting several rows from MySQL of > compressed/unstructured data (n rows), expanding the data set, and storing > the expanded results back into a MySQL DB (100,000n rows). DBInputFormat > seems to perform reasonably well but DBOutputFormat is inserting rows > one-by-one. How can I take advantage of MySQL's support of generating fewer > insert statements with more values within each one? > > Thanks. > -- > Nick Jones > >
-
Re: DBOutputFormat Speed Issues
Sonal Goyal 2010-02-01, 16:03
Hi Nick,
If you dont mind, can you please share your performance benchmarks of using DataDrivenInputFormat/DBInputFormat and MySQL?
Thanks and Regards, Sonal On Mon, Feb 1, 2010 at 3:33 AM, Aaron Kimball <[EMAIL PROTECTED]> wrote:
> Nick, > > I'm afraid that right now the only available OutputFormat for JDBC is that > one. You'll note that DBOutputFormat doesn't really include much support > for > special-casing to MySQL or other targets. > > Your best bet is to probably copy the code from DBOutputFormat and > DBConfiguration into some other class (e.g. MySQLDBOutputFormat) and modify > the code in the RecordWriter to generate PreparedStatements containing > batched insert statements. > > If you arrive at a solution which is pretty general-purpose/robust, please > consider contributing it back to the Hadoop project :) If you do so, send > me > an email off-list; I'm happy to help with advice on developing better DB > integration code, reviewing your work, etc. > > Also on the input side, you should really be using DataDrivenDBInputFormat > instead of the older DBIF :) Sqoop (in src/contrib/sqoop on Apache 0.21 / > CDH 0.20) has pretty good support for parallel imports, and uses this > InputFormat instead. > > - Aaron > > On Thu, Jan 28, 2010 at 11:39 AM, Nick Jones <[EMAIL PROTECTED]> wrote: > > > Hi all, > > I have a use case for collecting several rows from MySQL of > > compressed/unstructured data (n rows), expanding the data set, and > storing > > the expanded results back into a MySQL DB (100,000n rows). DBInputFormat > > seems to perform reasonably well but DBOutputFormat is inserting rows > > one-by-one. How can I take advantage of MySQL's support of generating > fewer > > insert statements with more values within each one? > > > > Thanks. > > -- > > Nick Jones > > > > >
-
RE: DBOutputFormat Speed Issues
Jones, Nick 2010-02-01, 18:03
Sonal, I'll try and put together something soon. I haven't located documentation on DataDrivenInputFormat yet to implement it. :)
Nick Jones
-----Original Message----- From: Sonal Goyal [mailto:[EMAIL PROTECTED]] Sent: Monday, February 01, 2010 10:03 AM To: [EMAIL PROTECTED] Subject: Re: DBOutputFormat Speed Issues
Hi Nick,
If you dont mind, can you please share your performance benchmarks of using DataDrivenInputFormat/DBInputFormat and MySQL?
Thanks and Regards, Sonal On Mon, Feb 1, 2010 at 3:33 AM, Aaron Kimball <[EMAIL PROTECTED]> wrote:
> Nick, > > I'm afraid that right now the only available OutputFormat for JDBC is that > one. You'll note that DBOutputFormat doesn't really include much support > for > special-casing to MySQL or other targets. > > Your best bet is to probably copy the code from DBOutputFormat and > DBConfiguration into some other class (e.g. MySQLDBOutputFormat) and modify > the code in the RecordWriter to generate PreparedStatements containing > batched insert statements. > > If you arrive at a solution which is pretty general-purpose/robust, please > consider contributing it back to the Hadoop project :) If you do so, send > me > an email off-list; I'm happy to help with advice on developing better DB > integration code, reviewing your work, etc. > > Also on the input side, you should really be using DataDrivenDBInputFormat > instead of the older DBIF :) Sqoop (in src/contrib/sqoop on Apache 0.21 / > CDH 0.20) has pretty good support for parallel imports, and uses this > InputFormat instead. > > - Aaron > > On Thu, Jan 28, 2010 at 11:39 AM, Nick Jones <[EMAIL PROTECTED]> wrote: > > > Hi all, > > I have a use case for collecting several rows from MySQL of > > compressed/unstructured data (n rows), expanding the data set, and > storing > > the expanded results back into a MySQL DB (100,000n rows). DBInputFormat > > seems to perform reasonably well but DBOutputFormat is inserting rows > > one-by-one. How can I take advantage of MySQL's support of generating > fewer > > insert statements with more values within each one? > > > > Thanks. > > -- > > Nick Jones > > > > >
|
|