Ant suggestion or pointers would be helpful. Are there any best practices?
On Mon, Apr 23, 2012 at 3:27 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
> I just wanted to check how do people design their storage directories for
> data that is sent to the system continuously. For eg: for a given
> functionality we get data feed continuously writen to sequencefile, that is
> then coverted to more structured format using map reduce and stored in tab
> separated files. For such continuous feed what's the best way to organize
> directories and the names? Should it be just based of timestamp or
> something better that helps in organizing data.
> Second part of question, is it better to store output in sequence files so
> that we can take advantage of compression per record. This seems to be
> required since gzip/snappy compression of entire file would launch only one
> map tasks.
> And the last question, when compressing a flat file should it first be
> split into multiple files so that we get multiple mappers if we need to run
> another job on this file? LZO is another alternative but then it requires
> additional configuration, is it preferred?
> Any articles or suggestions would be very helpful.