We have a use case that requires us to have the ability to:
* delete all of a customers data as it sits in hdfs on a whims notice
* Re-mapreduce all of a particular accounts data, going way back in time
This is how we're thinking of storing the logs in hdfs:
I imagine we would need to tune the hdfs block size depending on the size of the logs, and the goal would be to have 1 log file per account, per day (so we don't have a zillion files burdening the namenode).
We currently have large bz2 files with all account data mingled together flowing into hdfs. So I'm thinking the best approach would be to have a daily MR job that's uses MultipleOutputs, and creates block compressed sequence files split by account? Can MultipleOutputs specify different output directories for each output file, so that the output files don't have to be copied into the proper account directory after completing?
Is this approach sound? I thought it would be wise to solicit some feedback on here before starting to go down a path.