Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Partition performance


Copy link to this message
-
Re: Partition performance
1) each partition object is a row in the metastore usually mysql, querying
large tables with many partitions has longer startup time as the hive query
planner has to fetch and process all of this meta-information. This is not
a distributed process. It is usually fast within a few seconds but for very
large partitions it can be slow.

2) hadoop's small files problem. <- google that. Small files end up being
much more overhead for a given map reduce job, generally the more
files/partitions the more map/reduce tasks. More map reduce tasks is more
overhead, more overhead is less throughput.

::SHAMELESS PLUG:: We discuss this in detail the book programming hive, in
the schema design section

On Wed, Jul 3, 2013 at 8:19 AM, David Morel <[EMAIL PROTECTED]> wrote:

> On 2 Jul 2013, at 16:51, Owen O'Malley wrote:
>
> > On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron <
> > [EMAIL PROTECTED]> wrote:
> >
> >> Hi Owen,****
> >>
> >> ** **
> >>
> >> I’m curious about this advice about partitioning. Is there some
> >> fundamental reason why Hive****
> >>
> >> is slow when the number of partitions is 10,000 rather than 1,000?
> >>
> >
> > The precise numbers don't matter. I wanted to give people a ballpark
> range
> > that they should be looking at. Most tables at 1000 partitions won't
> cause
> > big slow downs, but the cost scales with the number of partitions. By the
> > time you are at 10,000 the cost is noticeable. I have one customer who
> has
> > a table with 1.2 million partitions. That causes a lot of slow downs.
>
> That is still not really answering the question, which is: why is it slower
> to run a query on a heavily partitioned table than it is on the same number
> of files in a less heavily partitioned table.
>
> David
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB