Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - Multiple Spooling Dir Sources


Copy link to this message
-
Multiple Spooling Dir Sources
Tim Driscoll 2013-07-22, 21:38
We are attempting to use the Spooling Directory Source to read data into
Flume.  Due to certain restrictions, we're stuck with placing around 2000
files in the directory to get processed, every 3-4 minutes.

The source does not seem to be able to keep up with this load and seems to
get progressively slower as we go along.  I'm fairly certain that it has to
do with the fact there are so many files, versus a couple really huge files
(but this is what we have to deal with).

>From what I can tell the source seems to be single threaded, which is not
ideal for this situation.  I was thinking of a couple options.

1. Create multiple Spooling Directory sources, pointing them to the same
directory, and changing the trackerDir.

2. Creating multiple Spooling Directory sources, pointing them to different
directories (if we can move the files to different dirs).

3.  Use some other source.  But given that these files are the inputs I
have to work with, not sure if there is another viable option.  Maybe Exec
source with 'tail', however, I don't think that would be viable either.

Does anyone have any suggestions?  Is it even plausible to use multiple
spool sources on the same directory?  Is there a config I'm missing to
process more than one file at a time?

Any help would be appreciated.

-Tim