Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - Best practices forking with files in Hadop MR jobs


+
Ivan Ryndin 2012-12-11, 05:59
Copy link to this message
-
Re: Best practices forking with files in Hadop MR jobs
Mahesh Balija 2012-12-11, 09:55
One more approach I would prefer is,

c) Once your job completes processing an input file move the file to
another path (say /input/processed)
and then delete the files in that path after all the jobs have finished its
execution.

If this solution doesn't work for you just stick to first one.

Best,
Mahesh Balija,
CalSoft Labs.

On Tue, Dec 11, 2012 at 11:29 AM, Ivan Ryndin <[EMAIL PROTECTED]> wrote:

> Hi all,
>
> I have following question:
> What are the best practices working with files in Hadoop?
>
> I need to process a lot of log files, that arrive to Hadoop every minute.
> And I have multiple jobs for each file.
> Each file have unique name which includes name of front-end node and a
> timestamp.
> File is considered to be fully processed if and only if all jobs are
> completed okay.
>
> Curently I put all files into a single HDFS input directory, e.g.
> /user/logp/input
> Then I run a bunch of jobs against files.
> After successfull completion I need to remove processed files anywhere
> from HDFS directory  /user/logp/input (e.g. on AWS S3 or Glacier or smth
> else)
>
> How should I work with such a problem?
> Currently I have two approaches:
>
> 1) Each job should copy files into its separate HDFS input directory (e.g.
> /user/logp/job/Job1/input/{timestamp}) and then read these files from
> there. When it process files okay, then it should remove files from there.
> Main jobs driver removes files from input directory once all jobs have
> copies of these files.
>
> 2nd aproach
> 2) Each job should have a table in HBase and check processed files there.
> Once job completed againts a file okay, it then writes filename into its
> HBase table.
> Main jobs driver then checks files in input directory and removes those of
> them which are processed by all jobs.
>
> Perhaps, I miss any other approach which is considred to be best practice?
> Can you please tell what do you think about all this?
>
> Thank you in advance!!
>
>
> --
> Best regards,
> Ivan
>
>