Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Re: Some general questions about DBInputFormat


+
Nick Jones 2012-09-11, 21:35
Copy link to this message
-
Re: Some general questions about DBInputFormat
Yaron Gonen 2012-09-12, 13:54
Hi again Nick,
DBInputFormat does use Connection.TRANSACTION_SERIALIZABLE, but this a per
connection attribute. Since every mapper has its own connection, and every
connection is opened in a different time, every connection sees a different
snapshot of the DB and it can cause for example two mapper that process the
same record (if an insert command was performed).

On Wed, Sep 12, 2012 at 12:35 AM, Nick Jones <[EMAIL PROTECTED]> wrote:

>  Hi Yaron,
>
> I haven't looked at/used it in awhile but I seem to remember that each
> mapper's SQL request was wrapped in a transaction to prevent the number of
> rows changing.  DBInputFormat uses Connection.TRANSACTION_SERIALIZABLE from
> java.sql.Connection to prevent changes in the number of rows selected from
> a where clause.
>
> The locking behavior I observed may have also been related to how MySQL
> was setup at the time.
>
>
> On 09/11/2012 09:25 AM, Yaron Gonen wrote:
>
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code,
> each mapper opens its own connection to the DB. I didn't see any code such
> that the job creates a transaction and passes it to the mapper. Did I
> miss something?
> again, thanks!
>
>
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <[EMAIL PROTECTED]> wrote:
>
>> Hi Yaron
>>
>> Replies inline below.
>>
>>
>> On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>>
>>>  Hi,
>>> After reviewing the class's (not very complicated) code, I have some
>>> questions I hope someone can answer:
>>>
>>>    * (more general question) Are there many use-cases for using
>>>
>>>     DBInputFormat? Do most Hadoop jobs take their input from files or
>>> DBs?
>>>
>>>  Bejoy's right, most jobs utilize data across HDFS or some other
>> distributed architecture to feed M/R at a sufficient rate. DBInputFormat
>> could be helpful in pulling pointers to other sources of data (e.g. file
>> paths for filers where actual binary content is stored).
>>
>>>
>>>   * What happens when the database is updated during mappers' data
>>>
>>>     retrieval phase? is there a way to lock the database before the
>>>     data retrieval phase and release it afterwords?
>>>
>>>  The whole job creates a transaction against the RBDMS that ensures
>> consistent state throughout the job.  Depending on the source and settings,
>> this might entirely lock a table or lock the selected rows by the query.
>>
>>>
>>>   * Since all mappers open a connection to the same DBS, one cannot
>>>
>>>     use hundreds of mapper. Is there a solution to this problem?
>>>
>>>  Depends on the connection limits and the number of rows requested.
>> I've found that the server suffered other problems first before connection
>> count limitations.
>>
>>>
>>> Thanks,
>>> Yaron
>>>
>>
>>
>>
>
>
+
Yaron Gonen 2012-09-11, 12:41
+
Nick Jones 2012-09-11, 13:00
+
Bejoy KS 2012-09-11, 12:48