Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Hive produces very small files despite hive.merge...=true settings


Copy link to this message
-
Re: Hive produces very small files despite hive.merge...=true settings
Leo:
You may find this helpful:
http://indoos.wordpress.com/2010/06/24/hive-remote-debugging/

On Thu, Nov 18, 2010 at 2:57 PM, Leo Alekseyev <[EMAIL PROTECTED]> wrote:

> Hi Ning,
> For the dataset I'm experimenting with, the total size of the output
> is 2mb, and the files are at most a few kb in size.  My
> hive.input.format was set to default HiveInputFormat; however, when I
> set it to CombineHiveInputFormat, it only made the first stage of the
> job use fewer mappers.  The merge job was *still* filtered out at
> runtime.  I also tried set hive.mergejob.maponly=false; that didn't
> have any effect.
>
> I am a bit at a loss what to do here.  Is there a way to see what's
> going on exactly using e.g. debug log levels?..  Btw, I'm also using
> dynamic partitions; could that somehow be interfering with the merge
> job?..
>
> I'm running a relatively fresh Hive from trunk (built maybe a month ago).
>
> --Leo
>
> On Thu, Nov 18, 2010 at 1:12 PM, Ning Zhang <[EMAIL PROTECTED]> wrote:
> > The settings looks good. The parameter hive.merge.size.smallfiles.avgsize
> is used to determine at run time if a merge should be triggered: if the
> average size of the files in the partition is SMALLER than the parameter and
> there are more than 1 file, the merge should be scheduled. Can you try to
> see if you have any big files as well in your resulting partition? If it is
> because of a very large file, you can set the parameter large enough.
> >
> > Another possibility is that your Hadoop installation does not support
> CombineHiveInputFormat, which is used for the new merge job. Someone
> reported previously merge was not successful because of this. If that's the
> case, you can turn off CombineHiveInputFormat and use the old
> HiveInputFormat (though slower) by setting hive.mergejob.maponly=false.
> >
> > Ning
> > On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote:
> >
> >> I have jobs that sample (or generate) a small amount of data from a
> >> large table.  At the end, I get e.g. about 3000 or more files of 1kb
> >> or so.  This becomes a nuisance.  How can I make Hive do another pass
> >> to merge the output?  I have the following settings:
> >>
> >> hive.merge.mapfiles=true
> >> hive.merge.mapredfiles=true
> >> hive.merge.size.per.task=256000000
> >> hive.merge.size.smallfiles.avgsize=16000000
> >>
> >> After setting hive.merge* to true, Hive started indicating "Total
> >> MapReduce jobs = 2".  However, after generating the
> >> lots-of-small-files table, Hive says:
> >> Ended Job = job_201011021934_1344
> >> Ended Job = 781771542, job is filtered out (removed at runtime).
> >>
> >> Is there a way to force the merge, or am I missing something?
> >> --Leo
> >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB