Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> concurrency


We have a dataset that is heavily partitioned, like this
/data
  partition1/
    _SUCESS
    part-00000
    part-00001
    ...
  partition1/
    _SUCCESS
    part-00000
    part-00001
    ....
  ...

We have loaders that use map-red jobs to add new partitions to this data
set at a regular interval (so they write to new sub-directories).

We also have map-red queries that read from the entire dataset (/data/*).
My worry here is concurrency. It will happen that a query job runs
while a loader
job is adding a new partition at the same time. Is there a risk that the query
could read incomplete or corrupt files? Is there a way to use the _SUCESS
files to prevent this from happening?
Thanks for your time!
Best,
Koert
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB