Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Concatenate multiple sequence files into 1 big sequence file


Copy link to this message
-
Re: Concatenate multiple sequence files into 1 big sequence file
Here's a great tool for exactly what you're looking for
https://github.com/edwardcapriolo/filecrush

On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <[EMAIL PROTECTED]> wrote:
> Hi Hadoop users,
>
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
>
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%[EMAIL PROTECTED]%3E
>
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
>
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
>
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>
>           reader =new SequenceFile.Reader(conf, Reader.file(seq.getPath()));
>
>             while (reader.next(readerKey, readerValue)) {
>
>               writer.append(readerKey, readerValue);
>
>             }
>
>         }
>
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able to
> read and write fast enough (about 6MB/sec).
>
> Any advice is appreciated,
>
>
> Jerry
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB