Essentially we are instrumenting distributed applications. The instrumented
message format is defined in an Avro schema. The messages are transported
over a message queue (eg: RabbitMQ) or (eventually) over Flume and dumped
into HDFS from where they are loaded into Hive for querying.

In HDFS we can certainly colocate the data into a small number of files.
But I want to know if we can minimize the network bandwidth by generating
valid messages from the client-side but w/o the schema in the header.

Does that make sense?

Shaq
On Mon, Mar 17, 2014 at 4:17 PM, Sean Busbey <busbey+[EMAIL PROTECTED]>wrote:
 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB