-Re: File block size use
Anna Lahoud 2012-10-02, 17:45
Chris - You are absolutely correct in what I am trying to accomplish -
decrease the number of files going to the maps. Admittedly, I haven't run
through all the suggestions yet today. I hope to do that by days' end.
Thank you and I will give an update later on what worked.
On Tue, Oct 2, 2012 at 12:12 AM, Chris Nauroth <[EMAIL PROTECTED]>wrote:
> Hello Anna,
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size. Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation. In my case, I was typically
> working with TextInputFormat (not sequence files). I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data. For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results. (This may or
> may not be true for your data set though.)
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
> Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size. This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size). I'm not aware of
> any built-in or external utilities that do this for you though.
> Hope this helps,
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <[EMAIL PROTECTED]> wrote:
>> I would like to be able to resize a set of inputs, already in
>> SequenceFile format, to be larger.
>> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
>> get what I expected. The outputs were exactly the same as the inputs.
>> I also tried running a job with an IdentityMapper and IdentityReducer.
>> Although that approaches a better solution, it still requires that I know
>> in advance how many reducers I need to get better file sizes.
>> I was looking at the SequenceFile.Writer constructors and noticed that
>> there are block size parameters that can be used. Using a writer
>> constructed with a 512MB block size, there is nothing that splits the
>> output and I simply get a single file the size of my inputs.
>> What is the current standard for combining sequence files to create
>> larger files for map-reduce jobs? I have seen code that tracks what it
>> writes into the file, but that seems like the long version. I am hoping
>> there is a shorter path.
>> Thank you.