Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Splitting logs in hdfs by account


Copy link to this message
-
Splitting logs in hdfs by account
We have a use case that requires us to have the ability to:

  *   delete all of a customers data as it sits in hdfs on a whims notice
  *   Re-mapreduce all of a particular accounts data, going way back in time

This is how we're thinking of storing the logs in hdfs:

/hdfs-path-to-data/accnt-1/YYYY-MM-DD.log
/hdfs-path-to-data/accnt-2/YYYY-MM-DD.log
..
I imagine we would need to tune the hdfs block size depending on the size of the logs, and the goal would be to have 1 log file per account, per day (so we don't have a zillion files burdening the namenode).
We currently have large bz2 files with all account data mingled together flowing into hdfs.  So I'm thinking the best approach would be to have a daily MR job that's uses MultipleOutputs, and creates block compressed sequence files split by account?  Can MultipleOutputs specify different output directories for each output file, so that the output files don't have to be copied into the proper account directory after completing?

Is this approach sound?  I thought it would be wise to solicit some feedback on here before starting to go down a path.

Thanks!

Sean

+
Tom Brown 2013-02-09, 21:31
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB