Datastream fileType will let you write text files.
CompressedStream will do just that.
SequenceFile will create sequence files as you have guessed and you can use
either Text or Writeable (bytes) for your data here.
So flume is configureable out of the box with regards to the size of your
files. Yes you are correct that it is better to create files that are at
least the size of a full block.
You can roll your files based on time, size, or number of events. Rolling
on an hourly basis makes perfect sense.
With all that said we recommend writing to avro container files as that
format is most well suited for being used in the Hadoop ecosystem.
Avro has many benefits which include support for compression, code
generation, versioning and schema evolution.
You can do this with flume by specifying the avro_event type for the
serializer property in your hdfs sink.
Hope this helps.
On Wed, Oct 30, 2013 at 4:15 PM, Jeremy Karlson <[EMAIL PROTECTED]>wrote:
> Hi everyone.
> I'm trying to set up Flume to log into HDFS. Along the way, Flume
> attaches a number of headers (environment, hostname, etc) that I would also
> like to store with my log messages. Ideally, I'd like to be able to use
> Hive to query all of this later. I must also admit to knowing next to
> nothing about HDFS. That probably doesn't help. :-P
> I'm confused about the HDFS sink configuration. Specifically, I'm trying
> to understand what these two options do (and how they interact):
> File Type:
> DataStream - This appears to write the event body, and loses all headers.
> CompressedStream - I assume just a compressed data stream.
> SequenceFile - I think this is what I want, since it seems to be a
> key/value based thing, which I assume means it will include headers.
> Write Format: This seems to only apply for SequenceFile above, but lots of
> Internet examples seem to state otherwise. I'm also unclear on the
> difference here. Isn't "Text" just a specific type of "Writable" in HDFS?
> Also, I'm unclear on why Flume, by default, seems to be set up to make
> such small HDFS files. Isn't HDFS designed (and more efficient) when
> storing larger files that are closer to the size of a full block? I was
> thinking it made more sense to write all log data to a single file, and
> roll that file hourly (or whatever, depending on volume). Thoughts here?
> Thanks a lot.
> -- Jeremy