Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> DBInputFormat / DBWritable question


Copy link to this message
-
Re: DBInputFormat / DBWritable question
Tnx much for the info, and the additional tips.

Unfortunately we're doing a lot of transforming of the DB data as we're
bringing it into Hadoop, so I don't think Sqoop's an option.

Thanks again,

DR

On 08/06/2010 12:50 AM, Aaron Kimball wrote:
> The InputFormat instantiates a RecordReader (DBRecordReader) in the same
> process as the Mapper. The DBWritable instances are instantiated inside the
> RecordReader and fed directly to your mapper.
>
> If your mapper then processes the data and sends it directly to the
> OutputFormat (e.g., through TextOutputFormat which just calls
> key/val.toString())  then you do not need to implement the Writable
> interface.
>
> If you intend to serialize your data to SequenceFiles (through
> SequenceFileOutputFormat, or otherwise) or as intermediate data (to be
> consumed by a reducer) then you need to implement Writable.
>
> For that matter, if you don't intend to use DBOutputFormat with this data,
> then you don't even need to provide a body for the "void
> write(PreparedStatement)" method; just stub it.
>
> A couple other tips:
> * Consider using DataDrivenDBInputFormat. It's considerably
> higher-throughput.
> * If you're using CDH (Cloudera's Distribution for Hadoop), rather than
> write your own DBWritable, use Sqoop's code generation capability (sqoop
> codegen --connect ... --table ...) to create your java class for you.
> * Related, if all you're doing is importing a copy of the data to HDFS,
> Sqoop can handle that for you pretty easily :)
>
> See github.com/cloudera/sqoop and archive.cloudera.com/cdh/3/sqoop for more
> info.
>
> Cheers,
> - Aaron
>
> On Wed, Aug 4, 2010 at 7:41 PM, Harsh J<[EMAIL PROTECTED]>  wrote:
>
>> AFAIK you don't really need serialization if your job is a map-only
>> one; the OutputFormat/RecWriter (if any) should take care of it.
>>
>> On Thu, Aug 5, 2010 at 7:07 AM, David Rosenstrauch<[EMAIL PROTECTED]>
>> wrote:
>>> I'm working on a M/R job which uses DBInputFormat.  So I have to create
>> my
>>> own DBWritable for this.  I'm a little bit confused about how to
>> implement
>>> this though.
>>>
>>> In the sample code in the Javadoc for the DBWritable class, the
>> MyWritable
>>> implements both DBWritable and Writable - thereby forcing the author of
>> the
>>> MyWritable class to implement the methods to serialize/deserialize it
>>> to/from DataInput&  DataOutput.  Without getting into too much detail,
>>> having to implement this serialization would add a good bit of complexity
>> to
>>> my code.
>>>
>>> However, the DBWritable that I'm writing really doesn't need to exist
>> beyond
>>> the Mapper.  I.e., it'll be input to the Mapper, but the Mapper won't
>> emit
>>> it out to the sort/reduce steps.  And after doing some reading/digging
>>> through the code, it looks to me like the InputFormat and the Mapper
>> always
>>> get run on the same host&  JVM.  If that's in fact the case, then there'd
>> be
>>> no need for me to make my DBWritable implement Writable also and so I
>> could
>>> avoid the whole serialization/deserialization issue.
>>>
>>> So my question is basically:  have I got this correct?  Do the
>> InputFormat
>>> and the Mapper always run in the same VM?  (In which case I can do what
>> I'm
>>> planning and code the DBWritable without the serialization headaches from
>>> the Writable class.)
>>>
>>> TIA,
>>>
>>> DR
>>>
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB