Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - [Input split] File manipulation


Copy link to this message
-
Re: [Input split] File manipulation
Jeff Zhang 2010-08-18, 01:37
The default split size is 64M which is the block size, and you can
change it by configuration.
What file type of your input file ? if it's gz , it can not been
spited, and you will always get only one mapper task.
On Wed, Aug 18, 2010 at 12:03 AM, Erik Test <[EMAIL PROTECTED]> wrote:
> I'm expecting to come across millions of data points.
>
> Thanks for the response by the way. I thought that Hadoop set the number of
> splits, regardless of file size, to just 1 by default.
> Erik
>
>
> On 17 August 2010 11:44, Jeff Zhang <[EMAIL PROTECTED]> wrote:
>
>> What size is your input ? If the input size is large enough, you do
>> not need to worry about the splitting, only one split (the last split)
>> has the different size, all the other splits has the same split.
>>
>>
>>
>> On Tue, Aug 17, 2010 at 7:50 AM, Erik Test <[EMAIL PROTECTED]> wrote:
>> > Hello,
>> >
>> > I'm trying to determine how to split a file evenly so each map task has a
>> > similar work load. The input I will have is a list of coordinates like
>> this:
>> >
>> > 2,8
>> > 3,9
>> > 4,10
>> > 5,7
>> > 6,2
>> > 7,3
>> > 8,1
>> > 9,0
>> > 10,4
>> >
>> > Since there are 9 inputs in this example, I would like to split the
>> records
>> > so that there would be 3 map tasks.
>> >
>> > I've been looking into different text input format classes but I'm still
>> not
>> > sure how to split the input file the way I would like to.
>> >
>> > Does anyone have advice or suggestions how I can go about manipulating
>> the
>> > input splits by specifying the number of lines are in an input split?
>> >
>> > Erik
>> >
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

--
Best Regards

Jeff Zhang