-Re: Archive Task Logs (Stdout, Stderr, Sysogs) and Job Tracker logs of a Hadoop Cluster for later analysis
>From your comments, it seems Flume will be the right tool for the task.
The SpoolingDirectorySource would be a great choice for the task you have
since the log data has already been generated.
However, the Spooling Directory Source requires that the files be immutable.
This means once a file is created or dropped in the spooling directory it
cannot be modified.
Consequently, you will not be able to just use the currently log directory
where the log files are continuously being appended to.
I would recommend that you set aside a separate directory for spooling for
Flume and then set up some sort of cronjob or scheduled task that will
periodically drop the logs into the spooling directory after traversing the
symlinks and recursively processing the log directories.
The SpoolingDirectorySource currently does not recursively traverse the
It assumes that all the files you plan to consume are in the root folder
you are spooling.
Use FileChannel as the channel as this is more reliable.
Depending of the type of analysis you want to conduct, the
ElasticSearchSink might be a good choice for your sink.
Feel free to review the user guide for other options for Sinks.
You can also set up your own custom sink if you have other centralized
datastores in mind.
Spend some time to go through the user guide and developer guide so that
you can get a better understanding of the architecture and use cases.
On 8 April 2013 10:33, Christian Schneider <[EMAIL PROTECTED]>wrote:
> I need to collect log data from our Cluster.
> For this I think I need to copy the Contents of:
> * JobTracker: /var/log/hadoop-0.20-mapreduce/history/
> * TaskTracker: /var/log/hadoop-0.20-mapreduce/userlogs/
> It should also follow symlinks and copy recusrive.
> Is flume the right tool to do this?
> E.g. with the "Spooling Directory Source"?
> Best Regards,