Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Design question


Copy link to this message
-
Re: Design question
Mohit Anchlia 2012-04-26, 14:43
Ant suggestion or pointers would be helpful. Are there any best practices?

On Mon, Apr 23, 2012 at 3:27 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> I just wanted to check how do people design their storage directories for
> data that is sent to the system continuously. For eg: for a given
> functionality we get data feed continuously writen to sequencefile, that is
> then coverted to more structured format using map reduce and stored in tab
> separated files. For such continuous feed what's the best way to organize
> directories and the names? Should it be just based of timestamp or
> something better that helps in organizing data.
>
> Second part of question, is it better to store output in sequence files so
> that we can take advantage of compression per record. This seems to be
> required since gzip/snappy compression of entire file would launch only one
> map tasks.
>
> And the last question, when compressing a flat file should it first be
> split into multiple files so that we get multiple mappers if we need to run
> another job on this file? LZO is another alternative but then it requires
> additional configuration, is it preferred?
>
> Any articles or suggestions would be very helpful.
>