Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Unsplittable files on HDFS


Copy link to this message
-
Re: Unsplittable files on HDFS
Niels Basjes 2011-04-27, 09:48
Hi,

I did the following with a 1.6GB file
   hadoop fs -Ddfs.block.size=2147483648 -put
/home/nbasjes/access-2010-11-29.log.gz /user/nbasjes
and I got

Total number of blocks: 1
4189183682512190568: 10.10.138.61:50010 10.10.138.62:50010

Yes, that does the trick. Thank you.

Niels

2011/4/27 Harsh J <[EMAIL PROTECTED]>:
> Hey Niels,
>
> The block size is a per-file property. Would putting/creating these
> gzip files on the DFS with a very high block size (such that it
> doesn't split across for such files) be a valid solution to your
> problem here?
>
> On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> In some scenarios you have gzipped files as input for your map reduce
>> job (apache logfiles is a common example).
>> Now some of those files are several hundred megabytes and as such will
>> be split by HDFS in several blocks.
>>
>> When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2)
>>
>> Total number of blocks: 2
>> 25063947863662497:           10.10.138.62:50010         10.10.138.61:50010
>> 1014249434553595747:   10.10.138.64:50010               10.10.138.63:50010
>>
>> As you can see the file has been distributed over all 4 nodes.
>>
>> When actually reading those files they are unsplittable due to the
>> nature of the Gzip codec.
>> So a job will (in the above example) ALWAYS need to pull "the other
>> half" of the file over the network, if a file is bigger and the
>> cluster is bigger then the percentage of the file that goes over the
>> network will probably increase.
>>
>> Now if I can tell HDFS that a ".gz" file should always be "100% local"
>> for the node that will be doing the processing this would reduce the
>> network IO during the job dramatically.
>> Especially if you want to run several jobs against the same input.
>>
>> So my question is: Is there a way to force/tell HDFS to make sure that
>> a datanode that has blocks of this file must always have ALL blocks of
>> this file?
>>
>> --
>> Best regards,
>>
>> Niels Basjes
>>
>
>
>
> --
> Harsh J
>

--
Met vriendelijke groeten,

Niels Basjes