Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Concatenate multiple sequence files into 1 big sequence file


+
Jay Vyas 2013-09-10, 15:17
+
John Meagher 2013-09-10, 15:08
Copy link to this message
-
Re: Concatenate multiple sequence files into 1 big sequence file
Hi guys,

Thank you for all the advices here. I really appreciate it.

I read through the code in filecrush and I found out that it is doing
exactly what I'm currently doing.
The logic resides in CrushReducer.java with the following lines that do the
concatenation:

while (reader.next(key, value)) {

   sink.write(key, value);

   reporter.incrCounter(ReducerCounter.RECORDS_CRUSHED, 1);

  }

I wonder if there are other faster ways to do this? Preferably a solution
that involves only streaming a set of sequence files to the final sequence
file.

Best Regards,
Jerry
On Tue, Sep 10, 2013 at 11:20 AM, Adam Muise <[EMAIL PROTECTED]> wrote:

> Jerry,
>
> It might not help with this particular file, but you might considered the
> approach used at Blackberry when dealing with your data. They block
> compressed into small avro files and then concatenated into large avro
> files without decompressing. Check out the boom file format here:
>
> https://github.com/blackberry/hadoop-logdriver
>
> for now, use filecrush:
> https://github.com/edwardcapriolo/filecrush
>
> Cheers,
>
>
>
>
> On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <[EMAIL PROTECTED]> wrote:
>
>> Hi Hadoop users,
>>
>> I have been trying to concatenate multiple sequence files into one.
>> Since the total size of the sequence files is quite big (1TB), I won't
>> use mapreduce because it requires 1TB in the reducer host to hold the
>> temporary data.
>>
>> I ended up doing what have been suggested in this thread:
>> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%[EMAIL PROTECTED]%3E
>>
>> It works very well. I wonder if there is a faster way to append to a
>> sequence file.
>>
>> Currently, the code looks like this (omit opening and closing sequence
>> files, exception handling etc):
>>
>> // each seq is a sequence file
>> // writer is a sequence file writer
>>         for (val seq : seqs) {
>>
>>           reader =new SequenceFile.Reader(conf,
>> Reader.file(seq.getPath()));
>>
>>             while (reader.next(readerKey, readerValue)) {
>>
>>               writer.append(readerKey, readerValue);
>>
>>             }
>>
>>         }
>>
>> Is there a better way to do this? Note that I think it is wasteful to
>> deserialize and serialize the key and value in the while loop because the
>> program simply append to the sequence file. Also, I don't seem to be able
>> to read and write fast enough (about 6MB/sec).
>>
>> Any advice is appreciated,
>>
>>
>> Jerry
>>
>
>
>
> --
> *
> *
> *
> *
> *Adam Muise*
> Solution Engineer
> *Hortonworks*
> [EMAIL PROTECTED]
> 416-417-4037
>
> Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.<http://hortonworks.com/>
>
> Hortonworks Virtual Sandbox <http://hortonworks.com/sandbox>
>
> Hadoop: Disruptive Possibilities by Jeff Needham<http://hortonworks.com/resources/?did=72&cat=1>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.