Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Concatenate multiple sequence files into 1 big sequence file

Copy link to this message
Re: Concatenate multiple sequence files into 1 big sequence file
Here's a great tool for exactly what you're looking for

On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <[EMAIL PROTECTED]> wrote:
> Hi Hadoop users,
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%[EMAIL PROTECTED]%3E
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>           reader =new SequenceFile.Reader(conf, Reader.file(seq.getPath()));
>             while (reader.next(readerKey, readerValue)) {
>               writer.append(readerKey, readerValue);
>             }
>         }
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able to
> read and write fast enough (about 6MB/sec).
> Any advice is appreciated,
> Jerry