-Re: CombineHiveInputFormat and Merge files not working for compressed text files
Igor Tatarinov 2011-11-30, 08:07
I might be wrong but I think EMR inserts a reduce job when writing data
into S3. At least in my case, I am able to create a single output file by
SET mapred.reduce.tasks = 1;
INSERT OVERWRITE TABLE price_history_s3
Without using any a combined format. The number of mappers _is_ determined
by the number of input files. But I think you can't use a combined input
format with Gzip files.
Perhaps you could run a separate query for each partition?
On Tue, Nov 29, 2011 at 11:18 PM, Mohit Gupta <[EMAIL PROTECTED]
> Hi All,
> I am using hive 0.7 on Amazon EMR. I need to merge a large number of small
> files into a few larger files( basically merging a number of partitions for
> a table into one). On doing the obvious query, i.e.( insert into a new
> partition select * from all partitions), a large number of small files are
> generated in the new partition. ( map-only job with no of output files
> equal to the no of mappers).
> Note: The table being processed here is stored in compressed format on s3.
> set hive.exec.compress.output = true;
> set mapred.output.compression.codec > org.apache.hadoop.io.compress.GzipCodec;
> set io.seqfile.compression.type = BLOCK;
> I found a couple of solutions on net but sadly neither of them work for me:
> 1. Merging small files
> I set the following parameters:
> set hive.merge.mapfiles=true;
> set hive.merge.size.per.task=256000000;
> set hive.merge.smallfiles.avgsize=100000000;
> set hive.merge.mapredfiles=true;
> set hive.merge.smallfiles.avgsize=1000000000;
> set hive.merge.size.smallfiles.avgsize=1000000000;
> Ideally, there should have been a reduce job after the map-only job to
> merge the small output files into a small no. of files. But, I could see no
> reduce job.
> 2. Using CombineHiveInputFormat
> Parameters Set:
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size.per.node=1000000000;
> set mapred.min.split.size.per.rack=1000000000;
> set mapred.max.split.size=1000000000;
> Ideally, here the no. of mappers created should have
> been considerably less than the no of input files, thereby producing a
> small no. of output files equal to the no. of mappers. But, I found the
> same no of mappers as no of input files.
> Approx size of small files: 125 KB
> No of small files >6k
> I found a couple of links saying that this merging stuff did not work for
> compressed files but now it is fixed.
> Any ideas how can I fix this!
> Thanks in Advance.
> Best Regards,
> Mohit Gupta
> Software Engineer at Vdopia Inc.