One of similar use case which I worked in , the record timestamp is not
guaranteed to arrive in some order. So we used Pig to do some processing
similar to what your custom code is doing and after the records are in
required order of timestamp we push them to hive.
Sent from Mobile , short and crisp.
On 06-Jun-2012 4:50 PM, "Philip Tromans" <[EMAIL PROTECTED]> wrote:
> Hi all,
> I'm interested in knowing how everyone is importing their data into
> their production Hive clusters.
> Let me explain a little more. At the moment, I have log files (which
> are divided into 5 minute chunks, per event type (of which there are
> around 10), per server (a few 10s) arriving on one of my Hadoop nodes.
> These log files then get glued together by some custom code, into
> entity+hour buckets. The code does a few non-utterly trivial things:
> * It supports an (almost) atomic file append.
> * It parses the timestamp out of each line in the log file to ensure
> that it ends up in the correct hour bucket (because some log file
> rotation ends up with some events from x:04:59.9 being written in the
> wrong file).
> * Once an entity+hour bucket hasn't changed for a while, it gets
> pushed into Hive.
> There's a bug in the code which is proving hard to track down amongst
> our high volume logs (hundreds of millions to billions of events per
> day), and we're going to shortly replace this architecture anyway with
> something based around Kafka/Flume/Storm, but I need an interim
> solution to the log aggregation problem.
> I can't just load the raw log files into Hive because it will most
> probably make the metastore grind to a halt (we easily processes tens
> of thousands of log files per day), and having one map task per file
> (I think) means the overheads of processing all of these files will be
> non-trivial. Is something like Flume the right way to go?
> If people are happy to share their data import strategies then that'd be