|
|
-
[Input split] File manipulation
Erik Test 2010-08-17, 14:50
Hello,
I'm trying to determine how to split a file evenly so each map task has a similar work load. The input I will have is a list of coordinates like this:
2,8 3,9 4,10 5,7 6,2 7,3 8,1 9,0 10,4
Since there are 9 inputs in this example, I would like to split the records so that there would be 3 map tasks.
I've been looking into different text input format classes but I'm still not sure how to split the input file the way I would like to.
Does anyone have advice or suggestions how I can go about manipulating the input splits by specifying the number of lines are in an input split?
Erik
-
Re: [Input split] File manipulation
Jeff Zhang 2010-08-17, 15:44
What size is your input ? If the input size is large enough, you do not need to worry about the splitting, only one split (the last split) has the different size, all the other splits has the same split.
On Tue, Aug 17, 2010 at 7:50 AM, Erik Test <[EMAIL PROTECTED]> wrote: > Hello, > > I'm trying to determine how to split a file evenly so each map task has a > similar work load. The input I will have is a list of coordinates like this: > > 2,8 > 3,9 > 4,10 > 5,7 > 6,2 > 7,3 > 8,1 > 9,0 > 10,4 > > Since there are 9 inputs in this example, I would like to split the records > so that there would be 3 map tasks. > > I've been looking into different text input format classes but I'm still not > sure how to split the input file the way I would like to. > > Does anyone have advice or suggestions how I can go about manipulating the > input splits by specifying the number of lines are in an input split? > > Erik >
-- Best Regards
Jeff Zhang
-
Re: [Input split] File manipulation
Erik Test 2010-08-17, 16:03
I'm expecting to come across millions of data points.
Thanks for the response by the way. I thought that Hadoop set the number of splits, regardless of file size, to just 1 by default. Erik On 17 August 2010 11:44, Jeff Zhang <[EMAIL PROTECTED]> wrote:
> What size is your input ? If the input size is large enough, you do > not need to worry about the splitting, only one split (the last split) > has the different size, all the other splits has the same split. > > > > On Tue, Aug 17, 2010 at 7:50 AM, Erik Test <[EMAIL PROTECTED]> wrote: > > Hello, > > > > I'm trying to determine how to split a file evenly so each map task has a > > similar work load. The input I will have is a list of coordinates like > this: > > > > 2,8 > > 3,9 > > 4,10 > > 5,7 > > 6,2 > > 7,3 > > 8,1 > > 9,0 > > 10,4 > > > > Since there are 9 inputs in this example, I would like to split the > records > > so that there would be 3 map tasks. > > > > I've been looking into different text input format classes but I'm still > not > > sure how to split the input file the way I would like to. > > > > Does anyone have advice or suggestions how I can go about manipulating > the > > input splits by specifying the number of lines are in an input split? > > > > Erik > > > > > > -- > Best Regards > > Jeff Zhang >
-
Re: [Input split] File manipulation
Jeff Zhang 2010-08-18, 01:37
The default split size is 64M which is the block size, and you can change it by configuration. What file type of your input file ? if it's gz , it can not been spited, and you will always get only one mapper task. On Wed, Aug 18, 2010 at 12:03 AM, Erik Test <[EMAIL PROTECTED]> wrote: > I'm expecting to come across millions of data points. > > Thanks for the response by the way. I thought that Hadoop set the number of > splits, regardless of file size, to just 1 by default. > Erik > > > On 17 August 2010 11:44, Jeff Zhang <[EMAIL PROTECTED]> wrote: > >> What size is your input ? If the input size is large enough, you do >> not need to worry about the splitting, only one split (the last split) >> has the different size, all the other splits has the same split. >> >> >> >> On Tue, Aug 17, 2010 at 7:50 AM, Erik Test <[EMAIL PROTECTED]> wrote: >> > Hello, >> > >> > I'm trying to determine how to split a file evenly so each map task has a >> > similar work load. The input I will have is a list of coordinates like >> this: >> > >> > 2,8 >> > 3,9 >> > 4,10 >> > 5,7 >> > 6,2 >> > 7,3 >> > 8,1 >> > 9,0 >> > 10,4 >> > >> > Since there are 9 inputs in this example, I would like to split the >> records >> > so that there would be 3 map tasks. >> > >> > I've been looking into different text input format classes but I'm still >> not >> > sure how to split the input file the way I would like to. >> > >> > Does anyone have advice or suggestions how I can go about manipulating >> the >> > input splits by specifying the number of lines are in an input split? >> > >> > Erik >> > >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> >
-- Best Regards
Jeff Zhang
-
Re: [Input split] File manipulation
Erik Test 2010-08-18, 05:53
I expect it to always be a text file. Now that you mention the default split size, I think I should lower it so I can at least test what I'm trying to do with distance calculations on a smaller scale to see how it works.
Erik On 17 August 2010 21:37, Jeff Zhang <[EMAIL PROTECTED]> wrote:
> The default split size is 64M which is the block size, and you can > change it by configuration. > What file type of your input file ? if it's gz , it can not been > spited, and you will always get only one mapper task. > > > On Wed, Aug 18, 2010 at 12:03 AM, Erik Test <[EMAIL PROTECTED]> wrote: > > I'm expecting to come across millions of data points. > > > > Thanks for the response by the way. I thought that Hadoop set the number > of > > splits, regardless of file size, to just 1 by default. > > Erik > > > > > > On 17 August 2010 11:44, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > >> What size is your input ? If the input size is large enough, you do > >> not need to worry about the splitting, only one split (the last split) > >> has the different size, all the other splits has the same split. > >> > >> > >> > >> On Tue, Aug 17, 2010 at 7:50 AM, Erik Test <[EMAIL PROTECTED]> > wrote: > >> > Hello, > >> > > >> > I'm trying to determine how to split a file evenly so each map task > has a > >> > similar work load. The input I will have is a list of coordinates like > >> this: > >> > > >> > 2,8 > >> > 3,9 > >> > 4,10 > >> > 5,7 > >> > 6,2 > >> > 7,3 > >> > 8,1 > >> > 9,0 > >> > 10,4 > >> > > >> > Since there are 9 inputs in this example, I would like to split the > >> records > >> > so that there would be 3 map tasks. > >> > > >> > I've been looking into different text input format classes but I'm > still > >> not > >> > sure how to split the input file the way I would like to. > >> > > >> > Does anyone have advice or suggestions how I can go about manipulating > >> the > >> > input splits by specifying the number of lines are in an input split? > >> > > >> > Erik > >> > > >> > >> > >> > >> -- > >> Best Regards > >> > >> Jeff Zhang > >> > > > > > > -- > Best Regards > > Jeff Zhang >
|
|