I'm interested in knowing how everyone is importing their data into
their production Hive clusters.
Let me explain a little more. At the moment, I have log files (which
are divided into 5 minute chunks, per event type (of which there are
around 10), per server (a few 10s) arriving on one of my Hadoop nodes.
These log files then get glued together by some custom code, into
entity+hour buckets. The code does a few non-utterly trivial things:
* It supports an (almost) atomic file append.
* It parses the timestamp out of each line in the log file to ensure
that it ends up in the correct hour bucket (because some log file
rotation ends up with some events from x:04:59.9 being written in the
* Once an entity+hour bucket hasn't changed for a while, it gets
pushed into Hive.
There's a bug in the code which is proving hard to track down amongst
our high volume logs (hundreds of millions to billions of events per
day), and we're going to shortly replace this architecture anyway with
something based around Kafka/Flume/Storm, but I need an interim
solution to the log aggregation problem.
I can't just load the raw log files into Hive because it will most
probably make the metastore grind to a halt (we easily processes tens
of thousands of log files per day), and having one map task per file
(I think) means the overheads of processing all of these files will be
non-trivial. Is something like Flume the right way to go?
If people are happy to share their data import strategies then that'd be great.