We have a dataset that is heavily partitioned, like this
We have loaders that use map-red jobs to add new partitions to this data
set at a regular interval (so they write to new sub-directories).
We also have map-red queries that read from the entire dataset (/data/*).
My worry here is concurrency. It will happen that a query job runs
while a loader
job is adding a new partition at the same time. Is there a risk that the query
could read incomplete or corrupt files? Is there a way to use the _SUCESS
files to prevent this from happening?
Thanks for your time!
Harsh J 2012-10-12, 15:35
J. Rottinghuis 2012-10-12, 16:07
Harsh J 2012-10-12, 16:17
Koert Kuipers 2012-10-12, 17:05