Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Partition performance

Copy link to this message
Re: Partition performance
1) each partition object is a row in the metastore usually mysql, querying
large tables with many partitions has longer startup time as the hive query
planner has to fetch and process all of this meta-information. This is not
a distributed process. It is usually fast within a few seconds but for very
large partitions it can be slow.

2) hadoop's small files problem. <- google that. Small files end up being
much more overhead for a given map reduce job, generally the more
files/partitions the more map/reduce tasks. More map reduce tasks is more
overhead, more overhead is less throughput.

::SHAMELESS PLUG:: We discuss this in detail the book programming hive, in
the schema design section

On Wed, Jul 3, 2013 at 8:19 AM, David Morel <[EMAIL PROTECTED]> wrote:

> On 2 Jul 2013, at 16:51, Owen O'Malley wrote:
> > On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron <
> > [EMAIL PROTECTED]> wrote:
> >
> >> Hi Owen,****
> >>
> >> ** **
> >>
> >> I’m curious about this advice about partitioning. Is there some
> >> fundamental reason why Hive****
> >>
> >> is slow when the number of partitions is 10,000 rather than 1,000?
> >>
> >
> > The precise numbers don't matter. I wanted to give people a ballpark
> range
> > that they should be looking at. Most tables at 1000 partitions won't
> cause
> > big slow downs, but the cost scales with the number of partitions. By the
> > time you are at 10,000 the cost is noticeable. I have one customer who
> has
> > a table with 1.2 million partitions. That causes a lot of slow downs.
> That is still not really answering the question, which is: why is it slower
> to run a query on a heavily partitioned table than it is on the same number
> of files in a less heavily partitioned table.
> David