Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> CombineHiveInputFormat and Merge files not working for compressed text files


Copy link to this message
-
Re: CombineHiveInputFormat and Merge files not working for compressed text files
I might be wrong but I think EMR inserts a reduce job when writing data
into S3. At least in my case, I am able to create a single output file by

SET mapred.reduce.tasks = 1;
INSERT OVERWRITE TABLE price_history_s3
...

Without using any a combined format. The number of mappers _is_ determined
by the number of input files. But I think you can't use a combined input
format with Gzip files.

Perhaps you could run a separate query for each partition?

igor
decide.com
On Tue, Nov 29, 2011 at 11:18 PM, Mohit Gupta <[EMAIL PROTECTED]
> wrote:

> Hi All,
> I am using hive 0.7 on Amazon EMR. I need to merge a large number of small
> files into a few larger files( basically merging a number of partitions for
> a table into one). On doing the obvious query, i.e.( insert into a new
> partition select * from all partitions), a large number of small files are
> generated in the new partition. ( map-only job with no of output files
> equal to the no of mappers).
>
> Note: The table being processed here is stored in compressed format on s3.
> set hive.exec.compress.output = true;
> set mapred.output.compression.codec > org.apache.hadoop.io.compress.GzipCodec;
> set io.seqfile.compression.type = BLOCK;
>
> I found a couple of solutions on net but sadly neither of them work for me:
> 1. Merging small files
> I set the following parameters:
> set hive.merge.mapfiles=true;
> set hive.merge.size.per.task=256000000;
> set hive.merge.smallfiles.avgsize=100000000;
> set hive.merge.mapredfiles=true;
>  set hive.merge.smallfiles.avgsize=1000000000;
>  set hive.merge.size.smallfiles.avgsize=1000000000;
>
> Ideally, there should have been a reduce job after the map-only job to
> merge the small output files into a small no. of files. But, I could see no
> reduce job.
>
> 2. Using CombineHiveInputFormat
> Parameters Set:
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size.per.node=1000000000;
> set mapred.min.split.size.per.rack=1000000000;
>  set mapred.max.split.size=1000000000;
>
> Ideally, here the no. of mappers created should have
> been considerably less than the no of input files, thereby producing a
> small no. of output files equal to the no. of mappers. But, I found the
> same no of mappers as no of input files.
>
> ------
> Specifics:
> Approx size of small files: 125 KB
> No of small files >6k
>
> I found a couple of links saying that this merging stuff did not work for
> compressed files but now it is fixed.
> Any ideas how can I fix this!
>
> Thanks in Advance.
>
> --
> Best Regards,
>
> Mohit Gupta
> Software Engineer at Vdopia Inc.
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB