Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> concurrency


We have a dataset that is heavily partitioned, like this
/data
  partition1/
    _SUCESS
    part-00000
    part-00001
    ...
  partition1/
    _SUCCESS
    part-00000
    part-00001
    ....
  ...

We have loaders that use map-red jobs to add new partitions to this data
set at a regular interval (so they write to new sub-directories).

We also have map-red queries that read from the entire dataset (/data/*).
My worry here is concurrency. It will happen that a query job runs
while a loader
job is adding a new partition at the same time. Is there a risk that the query
could read incomplete or corrupt files? Is there a way to use the _SUCESS
files to prevent this from happening?
Thanks for your time!
Best,
Koert