Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Unsplittable files on HDFS

Copy link to this message
Re: Unsplittable files on HDFS

I did the following with a 1.6GB file
   hadoop fs -Ddfs.block.size=2147483648 -put
/home/nbasjes/access-2010-11-29.log.gz /user/nbasjes
and I got

Total number of blocks: 1

Yes, that does the trick. Thank you.


2011/4/27 Harsh J <[EMAIL PROTECTED]>:
> Hey Niels,
> The block size is a per-file property. Would putting/creating these
> gzip files on the DFS with a very high block size (such that it
> doesn't split across for such files) be a valid solution to your
> problem here?
> On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes <[EMAIL PROTECTED]> wrote:
>> Hi,
>> In some scenarios you have gzipped files as input for your map reduce
>> job (apache logfiles is a common example).
>> Now some of those files are several hundred megabytes and as such will
>> be split by HDFS in several blocks.
>> When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2)
>> Total number of blocks: 2
>> 25063947863662497: 
>> 1014249434553595747:     
>> As you can see the file has been distributed over all 4 nodes.
>> When actually reading those files they are unsplittable due to the
>> nature of the Gzip codec.
>> So a job will (in the above example) ALWAYS need to pull "the other
>> half" of the file over the network, if a file is bigger and the
>> cluster is bigger then the percentage of the file that goes over the
>> network will probably increase.
>> Now if I can tell HDFS that a ".gz" file should always be "100% local"
>> for the node that will be doing the processing this would reduce the
>> network IO during the job dramatically.
>> Especially if you want to run several jobs against the same input.
>> So my question is: Is there a way to force/tell HDFS to make sure that
>> a datanode that has blocks of this file must always have ALL blocks of
>> this file?
>> --
>> Best regards,
>> Niels Basjes
> --
> Harsh J

Met vriendelijke groeten,

Niels Basjes