-Re: Unsplittable files on HDFS
Niels Basjes 2011-04-27, 09:48
I did the following with a 1.6GB file
hadoop fs -Ddfs.block.size=2147483648 -put
and I got
Total number of blocks: 1
4189183682512190568: 10.10.138.61:50010 10.10.138.62:50010
Yes, that does the trick. Thank you.
2011/4/27 Harsh J <[EMAIL PROTECTED]>:
> Hey Niels,
> The block size is a per-file property. Would putting/creating these
> gzip files on the DFS with a very high block size (such that it
> doesn't split across for such files) be a valid solution to your
> problem here?
> On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes <[EMAIL PROTECTED]> wrote:
>> In some scenarios you have gzipped files as input for your map reduce
>> job (apache logfiles is a common example).
>> Now some of those files are several hundred megabytes and as such will
>> be split by HDFS in several blocks.
>> When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2)
>> Total number of blocks: 2
>> 25063947863662497: 10.10.138.62:50010 10.10.138.61:50010
>> 1014249434553595747: 10.10.138.64:50010 10.10.138.63:50010
>> As you can see the file has been distributed over all 4 nodes.
>> When actually reading those files they are unsplittable due to the
>> nature of the Gzip codec.
>> So a job will (in the above example) ALWAYS need to pull "the other
>> half" of the file over the network, if a file is bigger and the
>> cluster is bigger then the percentage of the file that goes over the
>> network will probably increase.
>> Now if I can tell HDFS that a ".gz" file should always be "100% local"
>> for the node that will be doing the processing this would reduce the
>> network IO during the job dramatically.
>> Especially if you want to run several jobs against the same input.
>> So my question is: Is there a way to force/tell HDFS to make sure that
>> a datanode that has blocks of this file must always have ALL blocks of
>> this file?
>> Best regards,
>> Niels Basjes
> Harsh J
Met vriendelijke groeten,