|
David Morel
2013-01-25, 06:16
Nitin Pawar
2013-01-25, 06:47
Mathieu Despriee
2013-01-25, 07:44
Edward Capriolo
2013-01-25, 07:44
Edward Capriolo
2013-01-25, 07:46
Bertrand Dechoux
2013-01-25, 09:37
David Morel
2013-01-25, 09:53
David Morel
2013-01-25, 12:28
Dean Wampler
2013-01-25, 13:39
|
-
Real-life experience of forcing smaller input splits?David Morel 2013-01-25, 06:16
Hello,
I have seen many posts on various sites and MLs, but didn't find a firm answer anywhere: is it possible yes or no to force a smaller split size than a block on the mappers, from the client side? I'm not after pointers to the docs (unless you're very very sure :-) but after real-life experience along the lines of 'yes, it works this way, I've done it like this...' All the parameters that I could find (especially specifying a max input split size) seem to have no effect, and the files that I have are so heavily compressed that they completely saturate the mappers' memory when processed. A solution I could imagine for this specific issue is reducing the block size, but for now I simply went with disabling in-file compression for those. And changing the block size on a per-file basis is something I'd like to avoid if at all possible. All the hive settings that we tried only got me as far as raising the number of mappers from 5 to 6 (yay!) where I would have needed at least ten times more. Thanks! D.Morel
-
Re: Real-life experience of forcing smaller input splits?Nitin Pawar 2013-01-25, 06:47
set mapred.min.split.size=1024000;
set mapred.max.split.size=4096000; set hive.merge.mapfiles=false; I had set above value and setting max split size to a lower value did increase my # number of maps. My blocksize was 128MB Only thing was my files on hdfs were not heavily compressed and I was using RCFileFormat I would suggest if you have heavily compressed files then you may want to do check what will be size after uncompression and allocate more memory to maps On Fri, Jan 25, 2013 at 11:46 AM, David Morel <[EMAIL PROTECTED]> wrote: > Hello, > > I have seen many posts on various sites and MLs, but didn't find a firm > answer anywhere: is it possible yes or no to force a smaller split size > than a block on the mappers, from the client side? I'm not after > pointers to the docs (unless you're very very sure :-) but after > real-life experience along the lines of 'yes, it works this way, I've > done it like this...' > > All the parameters that I could find (especially specifying a max input > split size) seem to have no effect, and the files that I have are so > heavily compressed that they completely saturate the mappers' memory > when processed. > > A solution I could imagine for this specific issue is reducing the block > size, but for now I simply went with disabling in-file compression for > those. And changing the block size on a per-file basis is something I'd > like to avoid if at all possible. > > All the hive settings that we tried only got me as far as raising the > number of mappers from 5 to 6 (yay!) where I would have needed at least > ten times more. > > Thanks! > > D.Morel > -- Nitin Pawar
-
Re: Real-life experience of forcing smaller input splits?Mathieu Despriee 2013-01-25, 07:44
Hi David,
What file format and compression type are you using ? Mathieu Le 25 janv. 2013 à 07:16, David Morel <[EMAIL PROTECTED]> a écrit : > Hello, > > I have seen many posts on various sites and MLs, but didn't find a firm > answer anywhere: is it possible yes or no to force a smaller split size > than a block on the mappers, from the client side? I'm not after > pointers to the docs (unless you're very very sure :-) but after > real-life experience along the lines of 'yes, it works this way, I've > done it like this...' > > All the parameters that I could find (especially specifying a max input > split size) seem to have no effect, and the files that I have are so > heavily compressed that they completely saturate the mappers' memory > when processed. > > A solution I could imagine for this specific issue is reducing the block > size, but for now I simply went with disabling in-file compression for > those. And changing the block size on a per-file basis is something I'd > like to avoid if at all possible. > > All the hive settings that we tried only got me as far as raising the > number of mappers from 5 to 6 (yay!) where I would have needed at least > ten times more. > > Thanks! > > D.Morel
-
Re: Real-life experience of forcing smaller input splits?Edward Capriolo 2013-01-25, 07:44
In most cases you want bigger splits because having lots of small tasks
plays havoc on the job tracker. I have found that jobs with thousands of short lived map tasks tend to monopolize the slots. in other versions of hive the default was not CombineHiveInputFormat I think in most cases you want to make sure that is your default. On Fri, Jan 25, 2013 at 1:47 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > set mapred.min.split.size=1024000; > set mapred.max.split.size=4096000; > set hive.merge.mapfiles=false; > > I had set above value and setting max split size to a lower value did > increase my # number of maps. My blocksize was 128MB > Only thing was my files on hdfs were not heavily compressed and I was > using RCFileFormat > > I would suggest if you have heavily compressed files then you may want to > do check what will be size after uncompression and allocate more memory to > maps > > > On Fri, Jan 25, 2013 at 11:46 AM, David Morel <[EMAIL PROTECTED]> wrote: > >> Hello, >> >> I have seen many posts on various sites and MLs, but didn't find a firm >> answer anywhere: is it possible yes or no to force a smaller split size >> than a block on the mappers, from the client side? I'm not after >> pointers to the docs (unless you're very very sure :-) but after >> real-life experience along the lines of 'yes, it works this way, I've >> done it like this...' >> >> All the parameters that I could find (especially specifying a max input >> split size) seem to have no effect, and the files that I have are so >> heavily compressed that they completely saturate the mappers' memory >> when processed. >> >> A solution I could imagine for this specific issue is reducing the block >> size, but for now I simply went with disabling in-file compression for >> those. And changing the block size on a per-file basis is something I'd >> like to avoid if at all possible. >> >> All the hive settings that we tried only got me as far as raising the >> number of mappers from 5 to 6 (yay!) where I would have needed at least >> ten times more. >> >> Thanks! >> >> D.Morel >> > > > > -- > Nitin Pawar >
-
Re: Real-life experience of forcing smaller input splits?Edward Capriolo 2013-01-25, 07:46
Not all files are split-table Sequence Files are. Raw gzip files are not.
On Fri, Jan 25, 2013 at 1:47 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > set mapred.min.split.size=1024000; > set mapred.max.split.size=4096000; > set hive.merge.mapfiles=false; > > I had set above value and setting max split size to a lower value did > increase my # number of maps. My blocksize was 128MB > Only thing was my files on hdfs were not heavily compressed and I was > using RCFileFormat > > I would suggest if you have heavily compressed files then you may want to > do check what will be size after uncompression and allocate more memory to > maps > > > On Fri, Jan 25, 2013 at 11:46 AM, David Morel <[EMAIL PROTECTED]> wrote: > >> Hello, >> >> I have seen many posts on various sites and MLs, but didn't find a firm >> answer anywhere: is it possible yes or no to force a smaller split size >> than a block on the mappers, from the client side? I'm not after >> pointers to the docs (unless you're very very sure :-) but after >> real-life experience along the lines of 'yes, it works this way, I've >> done it like this...' >> >> All the parameters that I could find (especially specifying a max input >> split size) seem to have no effect, and the files that I have are so >> heavily compressed that they completely saturate the mappers' memory >> when processed. >> >> A solution I could imagine for this specific issue is reducing the block >> size, but for now I simply went with disabling in-file compression for >> those. And changing the block size on a per-file basis is something I'd >> like to avoid if at all possible. >> >> All the hive settings that we tried only got me as far as raising the >> number of mappers from 5 to 6 (yay!) where I would have needed at least >> ten times more. >> >> Thanks! >> >> D.Morel >> > > > > -- > Nitin Pawar >
-
Re: Real-life experience of forcing smaller input splits?Bertrand Dechoux 2013-01-25, 09:37
It seems to me the question has not been answered :
"is it possible yes or no to force a smaller split size than a block on the mappers" Not that I know (but you could implement something to do it) but why would you do it? By default if the split is set under the size of a block, it will be a block. One of the reason is data-locality. The second is that a block is written into a single hard-drive (leaving replicas aside) so if n mappers were reading n parts from the same block well they would share the hard-drive bandwidth... So it is not a clear win. You can change the block size of the file you want to read but using smaller block size is really an anti-pattern. Most people increase the block size. (Note : block size of files are fixed when writing the files and it can be different between two different files.) Are you trying to handle data which are too small? If hive supports multi-threading for mapper it might be an solution. But I don't the configuration for that. Regards Bertrand PS : the question is quite general and not really hive related On Fri, Jan 25, 2013 at 8:46 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > Not all files are split-table Sequence Files are. Raw gzip files are not. > > On Fri, Jan 25, 2013 at 1:47 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > >> set mapred.min.split.size=1024000; >> set mapred.max.split.size=4096000; >> set hive.merge.mapfiles=false; >> >> I had set above value and setting max split size to a lower value did >> increase my # number of maps. My blocksize was 128MB >> Only thing was my files on hdfs were not heavily compressed and I was >> using RCFileFormat >> >> I would suggest if you have heavily compressed files then you may want to >> do check what will be size after uncompression and allocate more memory to >> maps >> >> >> On Fri, Jan 25, 2013 at 11:46 AM, David Morel <[EMAIL PROTECTED]> wrote: >> >>> Hello, >>> >>> I have seen many posts on various sites and MLs, but didn't find a firm >>> answer anywhere: is it possible yes or no to force a smaller split size >>> than a block on the mappers, from the client side? I'm not after >>> pointers to the docs (unless you're very very sure :-) but after >>> real-life experience along the lines of 'yes, it works this way, I've >>> done it like this...' >>> >>> All the parameters that I could find (especially specifying a max input >>> split size) seem to have no effect, and the files that I have are so >>> heavily compressed that they completely saturate the mappers' memory >>> when processed. >>> >>> A solution I could imagine for this specific issue is reducing the block >>> size, but for now I simply went with disabling in-file compression for >>> those. And changing the block size on a per-file basis is something I'd >>> like to avoid if at all possible. >>> >>> All the hive settings that we tried only got me as far as raising the >>> number of mappers from 5 to 6 (yay!) where I would have needed at least >>> ten times more. >>> >>> Thanks! >>> >>> D.Morel >>> >> >> >> >> -- >> Nitin Pawar >> > > -- Bertrand Dechoux
-
Re: Real-life experience of forcing smaller input splits?David Morel 2013-01-25, 09:53
On 25 Jan 2013, at 10:37, Bertrand Dechoux wrote:
> It seems to me the question has not been answered : > "is it possible yes or no to force a smaller split size > than a block on the mappers" > > Not that I know (but you could implement something to do it) but why would > you do it? > By default if the split is set under the size of a block, it will be a > block. > One of the reason is data-locality. The second is that a block is written > into a single hard-drive (leaving replicas aside) so if n mappers were > reading n parts from the same block well they would share the hard-drive > bandwidth... So it is not a clear win. > > You can change the block size of the file you want to read but using > smaller block size is really an anti-pattern. Most people increase the > block size. > (Note : block size of files are fixed when writing the files and it can be > different between two different files.) > > Are you trying to handle data which are too small? > If hive supports multi-threading for mapper it might be an solution. But I > don't the configuration for that. The files are RCFiles with a block size of 128MB IIRC, but the file compression achieves a ratio of nearly 1 to 100. When going through the mapper, there is simply not enough memory available to it. Since the compression scheme is BLOCK, I expected it would be possible to instruct hive to process only a limited number of fragments instead of everything that's in the file in 1 go. David
-
Re: Real-life experience of forcing smaller input splits?David Morel 2013-01-25, 12:28
On 25 Jan 2013, at 10:37, Bertrand Dechoux wrote:
> It seems to me the question has not been answered : > "is it possible yes or no to force a smaller split size > than a block on the mappers" > > Not that I know (but you could implement something to do it) but why > would > you do it? > By default if the split is set under the size of a block, it will be a > block. > One of the reason is data-locality. The second is that a block is > written > into a single hard-drive (leaving replicas aside) so if n mappers were > reading n parts from the same block well they would share the > hard-drive > bandwidth... So it is not a clear win. > > You can change the block size of the file you want to read but using > smaller block size is really an anti-pattern. Most people increase the > block size. > (Note : block size of files are fixed when writing the files and it > can be > different between two different files.) That will be my approach for now, or disabling compression altogether for these files. The only problem I have is that compression is so efficient that any operation in the mapper (so on the uncompressed data) just makes the mapper throw an OOM exception, no matter how much memory I give it. What partly works though, is setting a low mapred.max.split.size. In a directory containing 34 files, I get 33 mappers (???). When setting hive.merge.mapfiles to false (and leaving mapred.max.split.size at its fs blocksize default), it doesn't seem to have any effect and I get 20 mappers only. > > Are you trying to handle data which are too small? > If hive supports multi-threading for mapper it might be an solution. > But I > don't the configuration for that. > > Regards > > Bertrand > > PS : the question is quite general and not really hive related I realized that after re-reading the whole thread :-) Thanks for all the answers, everyone! David > On Fri, Jan 25, 2013 at 8:46 AM, Edward Capriolo > <[EMAIL PROTECTED]>wrote: > >> Not all files are split-table Sequence Files are. Raw gzip files are >> not. >> >> On Fri, Jan 25, 2013 at 1:47 AM, Nitin Pawar >> <[EMAIL PROTECTED]>wrote: >> >>> set mapred.min.split.size=1024000; >>> set mapred.max.split.size=4096000; >>> set hive.merge.mapfiles=false; >>> >>> I had set above value and setting max split size to a lower value >>> did >>> increase my # number of maps. My blocksize was 128MB >>> Only thing was my files on hdfs were not heavily compressed and I >>> was >>> using RCFileFormat >>> >>> I would suggest if you have heavily compressed files then you may >>> want to >>> do check what will be size after uncompression and allocate more >>> memory to >>> maps >>> >>> >>> On Fri, Jan 25, 2013 at 11:46 AM, David Morel <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hello, >>>> >>>> I have seen many posts on various sites and MLs, but didn't find a >>>> firm >>>> answer anywhere: is it possible yes or no to force a smaller split >>>> size >>>> than a block on the mappers, from the client side? I'm not after >>>> pointers to the docs (unless you're very very sure :-) but after >>>> real-life experience along the lines of 'yes, it works this way, >>>> I've >>>> done it like this...' >>>> >>>> All the parameters that I could find (especially specifying a max >>>> input >>>> split size) seem to have no effect, and the files that I have are >>>> so >>>> heavily compressed that they completely saturate the mappers' >>>> memory >>>> when processed. >>>> >>>> A solution I could imagine for this specific issue is reducing the >>>> block >>>> size, but for now I simply went with disabling in-file compression >>>> for >>>> those. And changing the block size on a per-file basis is something >>>> I'd >>>> like to avoid if at all possible. >>>> >>>> All the hive settings that we tried only got me as far as raising >>>> the >>>> number of mappers from 5 to 6 (yay!) where I would have needed at >>>> least >>>> ten times more. >>>> >>>> Thanks! >>>> >>>> D.Morel >>>> >>>
-
Re: Real-life experience of forcing smaller input splits?Dean Wampler 2013-01-25, 13:39
...
> > That will be my approach for now, or disabling compression altogether for > these files. The only problem I have is that compression is so efficient > that any operation in the mapper (so on the uncompressed data) just makes > the mapper throw an OOM exception, no matter how much memory I give it. > > What partly works though, is setting a low mapred.max.split.size. In a > directory containing 34 files, I get 33 mappers (???). When setting > hive.merge.mapfiles to false (and leaving mapred.max.split.size at its fs > blocksize default), it doesn't seem to have any effect and I get 20 mappers > only. > > You can still use compression if you use a splittable format, like bzip2 with block compression. Gzip isn't splittable. If you're running out of memory, you could also increase the heap size for the client VMs. See the "Real-World Cluster Configurations" section of this page: http://hadoop.apache.org/docs/r1.0.3/cluster_setup.html By the way, you could also experiment with turning on intermediate compression; compression of the data sent between the mapper and reducer tasks, compression of the output, etc, as discussed here: https://cwiki.apache.org/Hive/adminmanual-configuration.html > ... >> > |