|
|
-
Unsplittable files on HDFS
Niels Basjes 2011-04-27, 07:55
Hi,
In some scenarios you have gzipped files as input for your map reduce job (apache logfiles is a common example). Now some of those files are several hundred megabytes and as such will be split by HDFS in several blocks.
When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2)
Total number of blocks: 2 25063947863662497: 10.10.138.62:50010 10.10.138.61:50010 1014249434553595747: 10.10.138.64:50010 10.10.138.63:50010
As you can see the file has been distributed over all 4 nodes.
When actually reading those files they are unsplittable due to the nature of the Gzip codec. So a job will (in the above example) ALWAYS need to pull "the other half" of the file over the network, if a file is bigger and the cluster is bigger then the percentage of the file that goes over the network will probably increase.
Now if I can tell HDFS that a ".gz" file should always be "100% local" for the node that will be doing the processing this would reduce the network IO during the job dramatically. Especially if you want to run several jobs against the same input.
So my question is: Is there a way to force/tell HDFS to make sure that a datanode that has blocks of this file must always have ALL blocks of this file?
-- Best regards,
Niels Basjes
-
Re: Unsplittable files on HDFS
Harsh J 2011-04-27, 09:06
Hey Niels,
The block size is a per-file property. Would putting/creating these gzip files on the DFS with a very high block size (such that it doesn't split across for such files) be a valid solution to your problem here?
On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes <[EMAIL PROTECTED]> wrote: > Hi, > > In some scenarios you have gzipped files as input for your map reduce > job (apache logfiles is a common example). > Now some of those files are several hundred megabytes and as such will > be split by HDFS in several blocks. > > When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2) > > Total number of blocks: 2 > 25063947863662497: 10.10.138.62:50010 10.10.138.61:50010 > 1014249434553595747: 10.10.138.64:50010 10.10.138.63:50010 > > As you can see the file has been distributed over all 4 nodes. > > When actually reading those files they are unsplittable due to the > nature of the Gzip codec. > So a job will (in the above example) ALWAYS need to pull "the other > half" of the file over the network, if a file is bigger and the > cluster is bigger then the percentage of the file that goes over the > network will probably increase. > > Now if I can tell HDFS that a ".gz" file should always be "100% local" > for the node that will be doing the processing this would reduce the > network IO during the job dramatically. > Especially if you want to run several jobs against the same input. > > So my question is: Is there a way to force/tell HDFS to make sure that > a datanode that has blocks of this file must always have ALL blocks of > this file? > > -- > Best regards, > > Niels Basjes >
-- Harsh J
-
Re: Unsplittable files on HDFS
Niels Basjes 2011-04-27, 09:48
Hi,
I did the following with a 1.6GB file hadoop fs -Ddfs.block.size=2147483648 -put /home/nbasjes/access-2010-11-29.log.gz /user/nbasjes and I got
Total number of blocks: 1 4189183682512190568: 10.10.138.61:50010 10.10.138.62:50010
Yes, that does the trick. Thank you.
Niels
2011/4/27 Harsh J <[EMAIL PROTECTED]>: > Hey Niels, > > The block size is a per-file property. Would putting/creating these > gzip files on the DFS with a very high block size (such that it > doesn't split across for such files) be a valid solution to your > problem here? > > On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes <[EMAIL PROTECTED]> wrote: >> Hi, >> >> In some scenarios you have gzipped files as input for your map reduce >> job (apache logfiles is a common example). >> Now some of those files are several hundred megabytes and as such will >> be split by HDFS in several blocks. >> >> When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2) >> >> Total number of blocks: 2 >> 25063947863662497: 10.10.138.62:50010 10.10.138.61:50010 >> 1014249434553595747: 10.10.138.64:50010 10.10.138.63:50010 >> >> As you can see the file has been distributed over all 4 nodes. >> >> When actually reading those files they are unsplittable due to the >> nature of the Gzip codec. >> So a job will (in the above example) ALWAYS need to pull "the other >> half" of the file over the network, if a file is bigger and the >> cluster is bigger then the percentage of the file that goes over the >> network will probably increase. >> >> Now if I can tell HDFS that a ".gz" file should always be "100% local" >> for the node that will be doing the processing this would reduce the >> network IO during the job dramatically. >> Especially if you want to run several jobs against the same input. >> >> So my question is: Is there a way to force/tell HDFS to make sure that >> a datanode that has blocks of this file must always have ALL blocks of >> this file? >> >> -- >> Best regards, >> >> Niels Basjes >> > > > > -- > Harsh J >
-- Met vriendelijke groeten,
Niels Basjes
-
Re: Unsplittable files on HDFS
Steve Loughran 2011-04-27, 11:09
On 27/04/11 10:48, Niels Basjes wrote: > Hi, > > I did the following with a 1.6GB file > hadoop fs -Ddfs.block.size=2147483648 -put > /home/nbasjes/access-2010-11-29.log.gz /user/nbasjes > and I got > > Total number of blocks: 1 > 4189183682512190568: 10.10.138.61:50010 10.10.138.62:50010 > > Yes, that does the trick. Thank you. > > Niels > > 2011/4/27 Harsh J<[EMAIL PROTECTED]>: >> Hey Niels, >> >> The block size is a per-file property. Would putting/creating these >> gzip files on the DFS with a very high block size (such that it >> doesn't split across for such files) be a valid solution to your >> problem here? >>
Don't set a block size >2GB, not all the bits of the code that use signed 32 bit integers have been eliminated yet.
|
|