-Re: DBInputFormat / DBWritable question
David Rosenstrauch 2010-08-09, 18:09
Tnx much for the info, and the additional tips.
Unfortunately we're doing a lot of transforming of the DB data as we're
bringing it into Hadoop, so I don't think Sqoop's an option.
On 08/06/2010 12:50 AM, Aaron Kimball wrote:
> The InputFormat instantiates a RecordReader (DBRecordReader) in the same
> process as the Mapper. The DBWritable instances are instantiated inside the
> RecordReader and fed directly to your mapper.
> If your mapper then processes the data and sends it directly to the
> OutputFormat (e.g., through TextOutputFormat which just calls
> key/val.toString()) then you do not need to implement the Writable
> If you intend to serialize your data to SequenceFiles (through
> SequenceFileOutputFormat, or otherwise) or as intermediate data (to be
> consumed by a reducer) then you need to implement Writable.
> For that matter, if you don't intend to use DBOutputFormat with this data,
> then you don't even need to provide a body for the "void
> write(PreparedStatement)" method; just stub it.
> A couple other tips:
> * Consider using DataDrivenDBInputFormat. It's considerably
> * If you're using CDH (Cloudera's Distribution for Hadoop), rather than
> write your own DBWritable, use Sqoop's code generation capability (sqoop
> codegen --connect ... --table ...) to create your java class for you.
> * Related, if all you're doing is importing a copy of the data to HDFS,
> Sqoop can handle that for you pretty easily :)
> See github.com/cloudera/sqoop and archive.cloudera.com/cdh/3/sqoop for more
> - Aaron
> On Wed, Aug 4, 2010 at 7:41 PM, Harsh J<[EMAIL PROTECTED]> wrote:
>> AFAIK you don't really need serialization if your job is a map-only
>> one; the OutputFormat/RecWriter (if any) should take care of it.
>> On Thu, Aug 5, 2010 at 7:07 AM, David Rosenstrauch<[EMAIL PROTECTED]>
>>> I'm working on a M/R job which uses DBInputFormat. So I have to create
>>> own DBWritable for this. I'm a little bit confused about how to
>>> this though.
>>> In the sample code in the Javadoc for the DBWritable class, the
>>> implements both DBWritable and Writable - thereby forcing the author of
>>> MyWritable class to implement the methods to serialize/deserialize it
>>> to/from DataInput& DataOutput. Without getting into too much detail,
>>> having to implement this serialization would add a good bit of complexity
>>> my code.
>>> However, the DBWritable that I'm writing really doesn't need to exist
>>> the Mapper. I.e., it'll be input to the Mapper, but the Mapper won't
>>> it out to the sort/reduce steps. And after doing some reading/digging
>>> through the code, it looks to me like the InputFormat and the Mapper
>>> get run on the same host& JVM. If that's in fact the case, then there'd
>>> no need for me to make my DBWritable implement Writable also and so I
>>> avoid the whole serialization/deserialization issue.
>>> So my question is basically: have I got this correct? Do the
>>> and the Mapper always run in the same VM? (In which case I can do what
>>> planning and code the DBWritable without the serialization headaches from
>>> the Writable class.)
>> Harsh J