Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Prepare input data for Hadoop


Copy link to this message
-
Re: Prepare input data for Hadoop
Use an external database (e.g., mysql) or some other transactional
bookkeeping system to record the state of all your datasets (STAGING,
UPLOADED, PROCESSED)

- Aaron
On Thu, Sep 17, 2009 at 7:17 PM, Huy Phan <[EMAIL PROTECTED]> wrote:

> Hi all,
>
> I have a question about strategy to prepare data for Hadoop to run their
> MapReduce job, we have to (somehow) copy input files from our local
> filesystem to HDFS, how can we make sure that one input file is not
> processed twice in different executions of the same MapReduce job (let's say
> my MapReduce job runs once each 30 mins) ?
> I don't want to delete my input files after finishing the MR job because I
> may want to re-use it later.
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB