Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - DBInputFormat / DBWritable question


Copy link to this message
-
Re: DBInputFormat / DBWritable question
David Rosenstrauch 2010-08-09, 18:09
Tnx much for the info, and the additional tips.

Unfortunately we're doing a lot of transforming of the DB data as we're
bringing it into Hadoop, so I don't think Sqoop's an option.

Thanks again,

DR

On 08/06/2010 12:50 AM, Aaron Kimball wrote:
> The InputFormat instantiates a RecordReader (DBRecordReader) in the same
> process as the Mapper. The DBWritable instances are instantiated inside the
> RecordReader and fed directly to your mapper.
>
> If your mapper then processes the data and sends it directly to the
> OutputFormat (e.g., through TextOutputFormat which just calls
> key/val.toString())  then you do not need to implement the Writable
> interface.
>
> If you intend to serialize your data to SequenceFiles (through
> SequenceFileOutputFormat, or otherwise) or as intermediate data (to be
> consumed by a reducer) then you need to implement Writable.
>
> For that matter, if you don't intend to use DBOutputFormat with this data,
> then you don't even need to provide a body for the "void
> write(PreparedStatement)" method; just stub it.
>
> A couple other tips:
> * Consider using DataDrivenDBInputFormat. It's considerably
> higher-throughput.
> * If you're using CDH (Cloudera's Distribution for Hadoop), rather than
> write your own DBWritable, use Sqoop's code generation capability (sqoop
> codegen --connect ... --table ...) to create your java class for you.
> * Related, if all you're doing is importing a copy of the data to HDFS,
> Sqoop can handle that for you pretty easily :)
>
> See github.com/cloudera/sqoop and archive.cloudera.com/cdh/3/sqoop for more
> info.
>
> Cheers,
> - Aaron
>
> On Wed, Aug 4, 2010 at 7:41 PM, Harsh J<[EMAIL PROTECTED]>  wrote:
>
>> AFAIK you don't really need serialization if your job is a map-only
>> one; the OutputFormat/RecWriter (if any) should take care of it.
>>
>> On Thu, Aug 5, 2010 at 7:07 AM, David Rosenstrauch<[EMAIL PROTECTED]>
>> wrote:
>>> I'm working on a M/R job which uses DBInputFormat.  So I have to create
>> my
>>> own DBWritable for this.  I'm a little bit confused about how to
>> implement
>>> this though.
>>>
>>> In the sample code in the Javadoc for the DBWritable class, the
>> MyWritable
>>> implements both DBWritable and Writable - thereby forcing the author of
>> the
>>> MyWritable class to implement the methods to serialize/deserialize it
>>> to/from DataInput&  DataOutput.  Without getting into too much detail,
>>> having to implement this serialization would add a good bit of complexity
>> to
>>> my code.
>>>
>>> However, the DBWritable that I'm writing really doesn't need to exist
>> beyond
>>> the Mapper.  I.e., it'll be input to the Mapper, but the Mapper won't
>> emit
>>> it out to the sort/reduce steps.  And after doing some reading/digging
>>> through the code, it looks to me like the InputFormat and the Mapper
>> always
>>> get run on the same host&  JVM.  If that's in fact the case, then there'd
>> be
>>> no need for me to make my DBWritable implement Writable also and so I
>> could
>>> avoid the whole serialization/deserialization issue.
>>>
>>> So my question is basically:  have I got this correct?  Do the
>> InputFormat
>>> and the Mapper always run in the same VM?  (In which case I can do what
>> I'm
>>> planning and code the DBWritable without the serialization headaches from
>>> the Writable class.)
>>>
>>> TIA,
>>>
>>> DR
>>>
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>>
>