Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Design question


Copy link to this message
-
Re: Design question
Ant suggestion or pointers would be helpful. Are there any best practices?

On Mon, Apr 23, 2012 at 3:27 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> I just wanted to check how do people design their storage directories for
> data that is sent to the system continuously. For eg: for a given
> functionality we get data feed continuously writen to sequencefile, that is
> then coverted to more structured format using map reduce and stored in tab
> separated files. For such continuous feed what's the best way to organize
> directories and the names? Should it be just based of timestamp or
> something better that helps in organizing data.
>
> Second part of question, is it better to store output in sequence files so
> that we can take advantage of compression per record. This seems to be
> required since gzip/snappy compression of entire file would launch only one
> map tasks.
>
> And the last question, when compressing a flat file should it first be
> split into multiple files so that we get multiple mappers if we need to run
> another job on this file? LZO is another alternative but then it requires
> additional configuration, is it preferred?
>
> Any articles or suggestions would be very helpful.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB