Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Best practices forking with files in Hadop MR jobs


Copy link to this message
-
Best practices forking with files in Hadop MR jobs
Hi all,

I have following question:
What are the best practices working with files in Hadoop?

I need to process a lot of log files, that arrive to Hadoop every minute.
And I have multiple jobs for each file.
Each file have unique name which includes name of front-end node and a
timestamp.
File is considered to be fully processed if and only if all jobs are
completed okay.

Curently I put all files into a single HDFS input directory, e.g.
/user/logp/input
Then I run a bunch of jobs against files.
After successfull completion I need to remove processed files anywhere from
HDFS directory  /user/logp/input (e.g. on AWS S3 or Glacier or smth else)

How should I work with such a problem?
Currently I have two approaches:

1) Each job should copy files into its separate HDFS input directory (e.g.
/user/logp/job/Job1/input/{timestamp}) and then read these files from
there. When it process files okay, then it should remove files from there.
Main jobs driver removes files from input directory once all jobs have
copies of these files.

2nd aproach
2) Each job should have a table in HBase and check processed files there.
Once job completed againts a file okay, it then writes filename into its
HBase table.
Main jobs driver then checks files in input directory and removes those of
them which are processed by all jobs.

Perhaps, I miss any other approach which is considred to be best practice?
Can you please tell what do you think about all this?

Thank you in advance!!
--
Best regards,
Ivan