Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Concatenate multiple sequence files into 1 big sequence file


Copy link to this message
-
Re: Concatenate multiple sequence files into 1 big sequence file
Here's a great tool for exactly what you're looking for
https://github.com/edwardcapriolo/filecrush

On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <[EMAIL PROTECTED]> wrote:
> Hi Hadoop users,
>
> I have been trying to concatenate multiple sequence files into one.
> Since the total size of the sequence files is quite big (1TB), I won't use
> mapreduce because it requires 1TB in the reducer host to hold the temporary
> data.
>
> I ended up doing what have been suggested in this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%[EMAIL PROTECTED]%3E
>
> It works very well. I wonder if there is a faster way to append to a
> sequence file.
>
> Currently, the code looks like this (omit opening and closing sequence
> files, exception handling etc):
>
> // each seq is a sequence file
> // writer is a sequence file writer
>         for (val seq : seqs) {
>
>           reader =new SequenceFile.Reader(conf, Reader.file(seq.getPath()));
>
>             while (reader.next(readerKey, readerValue)) {
>
>               writer.append(readerKey, readerValue);
>
>             }
>
>         }
>
> Is there a better way to do this? Note that I think it is wasteful to
> deserialize and serialize the key and value in the while loop because the
> program simply append to the sequence file. Also, I don't seem to be able to
> read and write fast enough (about 6MB/sec).
>
> Any advice is appreciated,
>
>
> Jerry