Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - What are all the factors that go into the number of mappers - ORC


Copy link to this message
-
Re: What are all the factors that go into the number of mappers - ORC
John Omernik 2014-02-04, 01:16
No the size is closer to 10GB, the difference between the tables is only
around 2000 bytes.  I will try to get exact numbers for you soon, I am
traveling right now, but I'll get you better data to work with shortly.

Thanks!

On Mon, Feb 3, 2014 at 12:22 AM, Prasanth Jayachandran <
[EMAIL PROTECTED]> wrote:

> Hi John
>
> Number of mappers is equal to the number of splits generated. Following
> are the factors that go into split generation
> 1) HDFS block size
> 2) Max split size
>
> a split is cut when
> 1) the cumulative size of all adjacent stripes are greater than HDFS block
> size
> 2) the cumulative size of all adjacent stripes are greater than max split
> size
>
> HDFS block size for ORC files will be min(1.5GB, 2*stripe_size) in the
> current version of hive (and probably hive 0.12 too). In older versions,
> HDFS block size = min(2GB, 2*stripe_size).
>
> The other important thing to note is ORC split is generated only when
> HiveInputFormat is used. By default hive uses CombineHiveInputFormat which
> uses a different strategy to generate splits. In CombineHiveInputFormat,
> many small files are combined together to form a large logical split.
>
> In any case for the size you had mentioned (2000 bytes) there should be
> only one mapper. Can you provide the value for following configs so that we
> can understand it better?
>
> 1) hive.input.format
> 2) hive.min.split.size
> 3) hive.max.split.size
> 4) total size on disk for the table
>
> Thanks
> Prasanth Jayachandran
>
> On Feb 2, 2014, at 5:25 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>
> > I have two clusters, but small dev clusters, and I loaded the same
> dataset into both of them.   The data size on disk is within 2000 Bytes.
> Both are ORC, one is Hive 11 and one is Hive 12.  One is allocating about 8
> more mappers to the exact same query. I am just curious what settings would
> change that. I checked through all my setting, but can't see what would
> cause the discrepancy. Is this an ORC v11 vs v12 thing?
> >
> > I'd be curious on the thoughts of the group.
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>