-Re: Block vs FileSplit vs record vs line
Mohammad Tariq 2013-03-14, 11:42
Just to add to what Manish sir has said, HDFS blocks and MR filesplits are
2 different things. filesplits are just logical division of your data such
that each split goes to a mapper for processing. split creation depends on
the InputFormat you use. but it's not always necessary that for each split
you'll always have an exclusive mapper. for example, if you process a huge
csv file with (say) 1 million rows, you won't get 1 million mappers as
it'll add a lot of overhead. the framework actually tries to do everything
as efficiently as possible.
On Thu, Mar 14, 2013 at 4:59 PM, Manish Bhoge <[EMAIL PROTECTED]>wrote:
> Each file is divided into split as per the map input format, each split is
> equal to a map. You rightly stated 1 split=1 block=1 map. Record can be
> combination of block defined by recordreader code. One record can be series
> of maps or splits or blocks.
> Hope this will clear.
> Sent from HTC via Rocket! excuse typo.
> * From: * Sai Sai <[EMAIL PROTECTED]>;
> * To: * [EMAIL PROTECTED] <[EMAIL PROTECTED]>;
> * Subject: * Re: Block vs FileSplit vs record vs line
> * Sent: * Thu, Mar 14, 2013 8:45:53 AM
> Just wondering if this is right way to understand this:
> A large file is split into multiple blocks and each block is split into
> multiple file splits and each file split has multiple records and each
> record has multiple lines. Each line is processed by 1 instance of mapper.
> Any help is appreciated.