|
|
-
question about processing large zip
Andrew McNair 2012-03-20, 00:26
Hi,
I have a large (~300 gig) zip of images that I need to process. My current workflow is to copy the zip to HDFS, use a custom input format to read the zip entries, do the processing in a map, and then generate a processing report in the reduce. I'm struggling to tune params right now with my cluster to make everything run smoothly, but I'm also worried that I'm missing a better way of processing.
Does anybody have suggestions for how to make the processing of a zip more parallel? The only other idea I had was uploading the zip as a sequence file, but that proved incredibly slow (~30 hours on my 3 node cluster to upload).
Thanks in advance.
-Andrew
-
Re: question about processing large zip
Robert Evans 2012-03-21, 15:37
How are your splitting the zip right now? Do you have multiple mappers and each mapper starts at the beginning of the zip and goes to the point it cares about or do you just have one mapper? If you are doing it the first way you may want to increase your replication factor. Alternatively you could use multiple zip files, one per mapper that you want to launch.
--Bobby Evans
On 3/19/12 7:26 PM, "Andrew McNair" <[EMAIL PROTECTED]> wrote:
Hi,
I have a large (~300 gig) zip of images that I need to process. My current workflow is to copy the zip to HDFS, use a custom input format to read the zip entries, do the processing in a map, and then generate a processing report in the reduce. I'm struggling to tune params right now with my cluster to make everything run smoothly, but I'm also worried that I'm missing a better way of processing.
Does anybody have suggestions for how to make the processing of a zip more parallel? The only other idea I had was uploading the zip as a sequence file, but that proved incredibly slow (~30 hours on my 3 node cluster to upload).
Thanks in advance.
-Andrew
-
Re: question about processing large zip
Joshua Smith 2012-03-26, 01:19
As I understand it, zip isn't splittable format. You might consider using bzip2 or another splittable compression format.
Alternatively, you could have one job that does the decompression chained to another that does the.processing to get the parallelization. On Mar 19, 2012 8:26 PM, "Andrew McNair" <[EMAIL PROTECTED]> wrote:
> Hi, > > I have a large (~300 gig) zip of images that I need to process. My > current workflow is to copy the zip to HDFS, use a custom input format > to read the zip entries, do the processing in a map, and then generate > a processing report in the reduce. I'm struggling to tune params right > now with my cluster to make everything run smoothly, but I'm also > worried that I'm missing a better way of processing. > > Does anybody have suggestions for how to make the processing of a zip > more parallel? The only other idea I had was uploading the zip as a > sequence file, but that proved incredibly slow (~30 hours on my 3 node > cluster to upload). > > Thanks in advance. > > -Andrew >
|
|