It's a common web log analysis situation. The original weblog is saved every hour on multiple servers. Now we would like the parsed log results to be saved one file an hour. How to make it?
In our MR job, the input is a directory with many files in many hours, let's say 4X files in X hours. if there are e.g. 10 Reducers, then all of the results would be partitioned into 10 files, each of which contains results in every hour. We would like the results to be save in X files, each of which contains only one-hour result. Since the input files could change, I can't even set the reducer number to be exactly X in the program.
If you only want one file, then you need to set the number of reducers to 1.
If the size of the data makes the original MR job impractical to use a single reducer, you run a second job on the output of the first, with the default mapper and reducer, which are the Identity- ones, and set that numReducers = 1.
Or use hdfs getmerge function to collate the results to one file. On Mar 1, 2014 4:59 AM, "Fengyun RAO" <[EMAIL PROTECTED]> wrote:
Is there any particular reason you have to have exactly 1 file per hour? As you probably knew already, each reducer will output 1 file, or if you use MultipleOutputs as I suggested, a set of files. If you have to fit the number of reducers to the number hours you have from the input, and generate the number of files accordingly, it will most likely be at the expense of cluster efficiency and performance. A worst case scenario of course is if you have a bunch of data all within the same hour, then you have to settle with 1 reducer without any parallelization at all.
A workaround is to use MultipleOutputs to generate a set of files for each hour, with the hour being a the base name. Or if you so choose, a sub-directory for each hour. For example if you use mmddhh as the base name, you will have a set of files for an hour like:
We want one file per hour for the reason of following query. It would be very convenient to select several specified hours' results.
We also need each record sorted by timestamp, for following processing. With a set of files for an hour, as you show in MultipleOutputs, we would have to merge sort them later. maybe need another MR job?
2014-03-02 13:14 GMT+08:00 Simon Dong <[EMAIL PROTECTED]>:
Don't you think using flume would be easier. Use hdfs sink and use a property to roll out the log file every hour. By doing this way you use a single flume agent to receive logs as and when it is generating and you will be directly dumping to hdfs. If you want to remove unwanted logs you can write a custom sink before dumping to hdfs
I suppose this would he much easier On 2 Mar 2014 12:34, "Fengyun RAO" <[EMAIL PROTECTED]> wrote:
thanks, Shekhar. I'm unfamiliar with Flume, but I will look into it later 2014-03-02 15:36 GMT+08:00 Shekhar Sharma <[EMAIL PROTECTED]>:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext