folks,

I have been working with
com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a
slightly different use case, but keep running into inneficiencies...prior
to detailing my use case, up front, I do have this working, but I feel like
it is horribly inefficient and definitely far from elegant.
   - Scan multiple directories (not one as is expected)
   - Accept changes to directories to be scanned on the fly
   - Accept multiple file types (based on checking magic bytes/number)
   - Assume that files may be in any of the following conditions:
   - "Raw"
      - Compressed
      - Archived
      - Compressed and Archived
   - Associate provenance (e.g. customer and sensor) with events extracted
   from these files

My existing solution was to provide my own implementation of
AbstractFileInputOperator.DirectoryScanner, and also to spit out
arrays/lists of events rather than Strings (lines from each file) due to
the binary nature of most of my input file types.

I am seeing several mismatches between my use case and the
AbstractFileInputOperator, but also see a ton of existing work within it
that I would prefer not to redo (partitioning, fault-tolerance, etc.).  Is
there a more appropriate class/Interface I should be looking at or is it
appropriate to create a new interface to handle a directory scanner that
accounts for multiple directories and the potential to deal with compressed
and archived files (thus things like openFile would need to support
outputting a list of inputstreams at a minimum to accomodate these
files)...I just want to make sure I am not overdoing things in a quest for
more efficient and clean code...

--

M. Aaron Bossert
(571) 242-4021
Punch Cyber Analytics Group
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB