Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Real-life experience of forcing smaller input splits?


+
David Morel 2013-01-25, 06:16
+
Mathieu Despriee 2013-01-25, 07:44
+
Nitin Pawar 2013-01-25, 06:47
+
Edward Capriolo 2013-01-25, 07:46
+
Bertrand Dechoux 2013-01-25, 09:37
Copy link to this message
-
Re: Real-life experience of forcing smaller input splits?
On 25 Jan 2013, at 10:37, Bertrand Dechoux wrote:

> It seems to me the question has not been answered :
> "is it possible yes or no to force a smaller split size
> than a block on the mappers"
>
> Not that I know (but you could implement something to do it) but why would
> you do it?
> By default if the split is set under the size of a block, it will be a
> block.
> One of the reason is data-locality. The second is that a block is written
> into a single hard-drive (leaving replicas aside) so if n mappers were
> reading n parts from the same block well they would share the hard-drive
> bandwidth... So it is not a clear win.
>
> You can change the block size of the file you want to read but using
> smaller block size is really an anti-pattern. Most people increase the
> block size.
> (Note : block size of files are fixed when writing the files and it can be
> different between two different files.)
>
> Are you trying to handle data which are too small?
> If hive supports multi-threading for mapper it might be an solution. But I
> don't the configuration for that.

The files are RCFiles with a block size of 128MB IIRC, but the file
compression achieves a ratio of nearly 1 to 100. When going through the
mapper, there is simply not enough memory available to it. Since the
compression scheme is BLOCK, I expected it would be possible to instruct
hive to process only a limited number of fragments instead of everything
that's in the file in 1 go.

David
+
David Morel 2013-01-25, 12:28
+
Dean Wampler 2013-01-25, 13:39
+
Edward Capriolo 2013-01-25, 07:44
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB