Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - concurrency


Copy link to this message
-
Re: concurrency
Harsh J 2012-10-12, 15:35
Hey Koert,

Yes the _SUCCESS (Created on successful commit-end of a job) file
existence may be checked before firing the new job with the chosen
input directory. This is consistent with what Oozie does as well.

Since the listing of files happens post-submit() call, doing this will
"just work" :)

On Fri, Oct 12, 2012 at 8:00 PM, Koert Kuipers <[EMAIL PROTECTED]> wrote:
> We have a dataset that is heavily partitioned, like this
> /data
>   partition1/
>     _SUCESS
>     part-00000
>     part-00001
>     ...
>   partition1/
>     _SUCCESS
>     part-00000
>     part-00001
>     ....
>   ...
>
> We have loaders that use map-red jobs to add new partitions to this data
> set at a regular interval (so they write to new sub-directories).
>
> We also have map-red queries that read from the entire dataset (/data/*).
> My worry here is concurrency. It will happen that a query job runs
> while a loader
> job is adding a new partition at the same time. Is there a risk that the query
> could read incomplete or corrupt files? Is there a way to use the _SUCESS
> files to prevent this from happening?
> Thanks for your time!
> Best,
> Koert

--
Harsh J