I'm trying to set up Flume to log into HDFS. Along the way, Flume attaches
a number of headers (environment, hostname, etc) that I would also like to
store with my log messages. Ideally, I'd like to be able to use Hive to
query all of this later. I must also admit to knowing next to nothing
about HDFS. That probably doesn't help. :-P
I'm confused about the HDFS sink configuration. Specifically, I'm trying
to understand what these two options do (and how they interact):
DataStream - This appears to write the event body, and loses all headers.
CompressedStream - I assume just a compressed data stream.
SequenceFile - I think this is what I want, since it seems to be a
key/value based thing, which I assume means it will include headers.
Write Format: This seems to only apply for SequenceFile above, but lots of
Internet examples seem to state otherwise. I'm also unclear on the
difference here. Isn't "Text" just a specific type of "Writable" in HDFS?
Also, I'm unclear on why Flume, by default, seems to be set up to make such
small HDFS files. Isn't HDFS designed (and more efficient) when storing
larger files that are closer to the size of a full block? I was thinking
it made more sense to write all log data to a single file, and roll that
file hourly (or whatever, depending on volume). Thoughts here?
Thanks a lot.