Essentially we are instrumenting distributed applications. The instrumented
message format is defined in an Avro schema. The messages are transported
over a message queue (eg: RabbitMQ) or (eventually) over Flume and dumped
into HDFS from where they are loaded into Hive for querying.
In HDFS we can certainly colocate the data into a small number of files.
But I want to know if we can minimize the network bandwidth by generating
valid messages from the client-side but w/o the schema in the header.
Does that make sense?
On Mon, Mar 17, 2014 at 4:17 PM, Sean Busbey <busbey+[EMAIL PROTECTED]>wrote: