|
maha
2011-02-18, 19:14
Ted Dunning
2011-02-18, 19:25
Jim Falgout
2011-02-18, 19:55
maha
2011-02-18, 22:07
maha
2011-02-20, 19:47
maha
2011-02-20, 19:59
maha
2011-02-20, 20:15
Ted Dunning
2011-02-21, 06:22
Jim Falgout
2011-02-21, 14:41
maha
2011-02-21, 15:53
maha
2011-02-21, 16:49
|
-
Quick questionmaha 2011-02-18, 19:14
Hi all,
I want to check if the following statement is right: If I use TextInputFormat to process a text file with 2000 lines (each ending with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES . In other words, the input is not split byte-wise but by lines. Is that right? Thank you, Maha
-
Re: Quick questionTed Dunning 2011-02-18, 19:25
The input is effectively split by lines, but under the covers, the actual
splits are by byte. Each mapper will cleverly scan from the specified start to the next line after the start point. At then end, it will over-read to the end of line that is at or after the end of its specified region. This can make the last split be a bit smaller than the others and the first be a bit larger. Practically speaking, however, your 2000 line file is extremely unlikely to be split at all because it is sooo small. On Fri, Feb 18, 2011 at 11:14 AM, maha <[EMAIL PROTECTED]> wrote: > Hi all, > > I want to check if the following statement is right: > > If I use TextInputFormat to process a text file with 2000 lines (each > ending with \n) with 20 mappers. Then each map will have a sequence of > COMPLETE LINES . > > In other words, the input is not split byte-wise but by lines. > > Is that right? > > > Thank you, > Maha
-
RE: Quick questionJim Falgout 2011-02-18, 19:55
That's right. The TextInputFormat handles situations where records cross split boundaries. What your mapper will see is "whole" records.
-----Original Message----- From: maha [mailto:[EMAIL PROTECTED]] Sent: Friday, February 18, 2011 1:14 PM To: common-user Subject: Quick question Hi all, I want to check if the following statement is right: If I use TextInputFormat to process a text file with 2000 lines (each ending with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES . In other words, the input is not split byte-wise but by lines. Is that right? Thank you, Maha
-
Re: Quick questionmaha 2011-02-18, 22:07
Thanks Ted and Jim :)
Maha On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote: > That's right. The TextInputFormat handles situations where records cross split boundaries. What your mapper will see is "whole" records. > > -----Original Message----- > From: maha [mailto:[EMAIL PROTECTED]] > Sent: Friday, February 18, 2011 1:14 PM > To: common-user > Subject: Quick question > > Hi all, > > I want to check if the following statement is right: > > If I use TextInputFormat to process a text file with 2000 lines (each ending with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES . > > In other words, the input is not split byte-wise but by lines. > > Is that right? > > > Thank you, > Maha >
-
Re: Quick questionmaha 2011-02-20, 19:47
Hi again Jim and Ted,
I understood that each mapper will be getting a block of lines... but even thought I had only 2 mappers for a 16 lines of input file and TextInputFormat is used. A map-function is processed for each of those 16 lines! I wanted a block of lines per map ... hence something like map1 has 8 lines and map2 has 8 lines. So first question: is there a difference between Mappers and maps ? Second: Does that mean I need to write my own inputFormat to make the InputSplit equal to multipleLines ??? Thank you, Maha On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote: > That's right. The TextInputFormat handles situations where records cross split boundaries. What your mapper will see is "whole" records. > > -----Original Message----- > From: maha [mailto:[EMAIL PROTECTED]] > Sent: Friday, February 18, 2011 1:14 PM > To: common-user > Subject: Quick question > > Hi all, > > I want to check if the following statement is right: > > If I use TextInputFormat to process a text file with 2000 lines (each ending with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES . > > In other words, the input is not split byte-wise but by lines. > > Is that right? > > > Thank you, > Maha >
-
Re: Quick questionmaha 2011-02-20, 19:59
Actually the following solved my problem ... but I'm a little suspicious of the side effect of doing the following instead of using my own InputSplit to be 5 lines.
conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // # of maps = # lines conf.setInt("mapred.line.input.format.linespermap", 5); //# of lines per mapper = 5 If you have any thought of whether the upper solution is worst that writing my own inputSplit to be about 5 lines, let me know. Thanks everyone ! Maha On Feb 20, 2011, at 11:47 AM, maha wrote: > Hi again Jim and Ted, > > I understood that each mapper will be getting a block of lines... but even thought I had only 2 mappers for a 16 lines of input file and TextInputFormat is used. A map-function is processed for each of those 16 lines! > > I wanted a block of lines per map ... hence something like map1 has 8 lines and map2 has 8 lines. > > So first question: is there a difference between Mappers and maps ? > > Second: Does that mean I need to write my own inputFormat to make the InputSplit equal to multipleLines ??? > > Thank you, > > Maha > > > On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote: > >> That's right. The TextInputFormat handles situations where records cross split boundaries. What your mapper will see is "whole" records. >> >> -----Original Message----- >> From: maha [mailto:[EMAIL PROTECTED]] >> Sent: Friday, February 18, 2011 1:14 PM >> To: common-user >> Subject: Quick question >> >> Hi all, >> >> I want to check if the following statement is right: >> >> If I use TextInputFormat to process a text file with 2000 lines (each ending with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES . >> >> In other words, the input is not split byte-wise but by lines. >> >> Is that right? >> >> >> Thank you, >> Maha >> >
-
Re: Quick questionmaha 2011-02-20, 20:15
Yet the map-function was processed 16 times as described by the NLineInputSplit. I want the map-function to be one for the whole inputSplit of 5 Lines and not for each of the 16 lines.
Any ideas other than building my own inputFormat? Thank you, Maha On Feb 20, 2011, at 11:59 AM, maha wrote: > Actually the following solved my problem ... but I'm a little suspicious of the side effect of doing the following instead of using my own InputSplit to be 5 lines. > > conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // # of maps = # lines > conf.setInt("mapred.line.input.format.linespermap", 5); //# of lines per mapper = 5 > > If you have any thought of whether the upper solution is worst that writing my own inputSplit to be about 5 lines, let me know. > > Thanks everyone ! > > Maha > > On Feb 20, 2011, at 11:47 AM, maha wrote: > >> Hi again Jim and Ted, >> >> I understood that each mapper will be getting a block of lines... but even thought I had only 2 mappers for a 16 lines of input file and TextInputFormat is used. A map-function is processed for each of those 16 lines! >> >> I wanted a block of lines per map ... hence something like map1 has 8 lines and map2 has 8 lines. >> >> So first question: is there a difference between Mappers and maps ? >> >> Second: Does that mean I need to write my own inputFormat to make the InputSplit equal to multipleLines ??? >> >> Thank you, >> >> Maha >> >> >> On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote: >> >>> That's right. The TextInputFormat handles situations where records cross split boundaries. What your mapper will see is "whole" records. >>> >>> -----Original Message----- >>> From: maha [mailto:[EMAIL PROTECTED]] >>> Sent: Friday, February 18, 2011 1:14 PM >>> To: common-user >>> Subject: Quick question >>> >>> Hi all, >>> >>> I want to check if the following statement is right: >>> >>> If I use TextInputFormat to process a text file with 2000 lines (each ending with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES . >>> >>> In other words, the input is not split byte-wise but by lines. >>> >>> Is that right? >>> >>> >>> Thank you, >>> Maha >>> >> >
-
Re: Quick questionTed Dunning 2011-02-21, 06:22
This is the most important thing that you have said. The map function
is called once per unit of input but the mapper object persists for many input units of input. You have a little bit of control over how many mapper objects there are and how many machines they are created on and how many pieces your input is broken into. That control is limited, however, unless you build your own input format. The standard input formats are optimized for very large inputs and may not give you the flexibility that you want for your experiments. That is unfortunate for the purpose of learning about hadoop but hadoop is designed mostly for dealing with very large data and isn't usually designed to be easy to understand. Where easy coincides with powerful then easy is good but powerful isn't always easy. On Sunday, February 20, 2011, maha <[EMAIL PROTECTED]> wrote: > So first question: is there a difference between Mappers and maps ?
-
RE: Quick questionJim Falgout 2011-02-21, 14:41
You're scenario matches the capability of NLineInputFormat exactly, so that looks to be the best solution. If you wrote your own input format, it would have to basically do what NLineInputFormat is already doing for you.
-----Original Message----- From: maha [mailto:[EMAIL PROTECTED]] Sent: Sunday, February 20, 2011 2:00 PM To: [EMAIL PROTECTED] Subject: Re: Quick question Actually the following solved my problem ... but I'm a little suspicious of the side effect of doing the following instead of using my own InputSplit to be 5 lines. conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // # of maps = # lines conf.setInt("mapred.line.input.format.linespermap", 5); //# of lines per mapper = 5 If you have any thought of whether the upper solution is worst that writing my own inputSplit to be about 5 lines, let me know. Thanks everyone ! Maha On Feb 20, 2011, at 11:47 AM, maha wrote: > Hi again Jim and Ted, > > I understood that each mapper will be getting a block of lines... but even thought I had only 2 mappers for a 16 lines of input file and TextInputFormat is used. A map-function is processed for each of those 16 lines! > > I wanted a block of lines per map ... hence something like map1 has 8 lines and map2 has 8 lines. > > So first question: is there a difference between Mappers and maps ? > > Second: Does that mean I need to write my own inputFormat to make the InputSplit equal to multipleLines ??? > > Thank you, > > Maha > > > On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote: > >> That's right. The TextInputFormat handles situations where records cross split boundaries. What your mapper will see is "whole" records. >> >> -----Original Message----- >> From: maha [mailto:[EMAIL PROTECTED]] >> Sent: Friday, February 18, 2011 1:14 PM >> To: common-user >> Subject: Quick question >> >> Hi all, >> >> I want to check if the following statement is right: >> >> If I use TextInputFormat to process a text file with 2000 lines (each ending with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES . >> >> In other words, the input is not split byte-wise but by lines. >> >> Is that right? >> >> >> Thank you, >> Maha >> >
-
Re: Quick questionmaha 2011-02-21, 15:53
Thanks for your answers Ted and Jim :)
Maha On Feb 21, 2011, at 6:41 AM, Jim Falgout wrote: > You're scenario matches the capability of NLineInputFormat exactly, so that looks to be the best solution. If you wrote your own input format, it would have to basically do what NLineInputFormat is already doing for you. > > -----Original Message----- > From: maha [mailto:[EMAIL PROTECTED]] > Sent: Sunday, February 20, 2011 2:00 PM > To: [EMAIL PROTECTED] > Subject: Re: Quick question > > Actually the following solved my problem ... but I'm a little suspicious of the side effect of doing the following instead of using my own InputSplit to be 5 lines. > > conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // # of maps = # lines conf.setInt("mapred.line.input.format.linespermap", 5); //# of lines per mapper = 5 > > If you have any thought of whether the upper solution is worst that writing my own inputSplit to be about 5 lines, let me know. > > Thanks everyone ! > > Maha > > On Feb 20, 2011, at 11:47 AM, maha wrote: > >> Hi again Jim and Ted, >> >> I understood that each mapper will be getting a block of lines... but even thought I had only 2 mappers for a 16 lines of input file and TextInputFormat is used. A map-function is processed for each of those 16 lines! >> >> I wanted a block of lines per map ... hence something like map1 has 8 lines and map2 has 8 lines. >> >> So first question: is there a difference between Mappers and maps ? >> >> Second: Does that mean I need to write my own inputFormat to make the InputSplit equal to multipleLines ??? >> >> Thank you, >> >> Maha >> >> >> On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote: >> >>> That's right. The TextInputFormat handles situations where records cross split boundaries. What your mapper will see is "whole" records. >>> >>> -----Original Message----- >>> From: maha [mailto:[EMAIL PROTECTED]] >>> Sent: Friday, February 18, 2011 1:14 PM >>> To: common-user >>> Subject: Quick question >>> >>> Hi all, >>> >>> I want to check if the following statement is right: >>> >>> If I use TextInputFormat to process a text file with 2000 lines (each ending with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES . >>> >>> In other words, the input is not split byte-wise but by lines. >>> >>> Is that right? >>> >>> >>> Thank you, >>> Maha >>> >> > >
-
Re: Quick questionmaha 2011-02-21, 16:49
How can then I produce an output/file per mapper not map-task?
Thank you, Maha On Feb 20, 2011, at 10:22 PM, Ted Dunning wrote: > This is the most important thing that you have said. The map function > is called once per unit of input but the mapper object persists for > many input units of input. > > You have a little bit of control over how many mapper objects there > are and how many machines they are created on and how many pieces your > input is broken into. That control is limited, however, unless you > build your own input format. The standard input formats are optimized > for very large inputs and may not give you the flexibility that you > want for your experiments. That is unfortunate for the purpose of > learning about hadoop but hadoop is designed mostly for dealing with > very large data and isn't usually designed to be easy to understand. > Where easy coincides with powerful then easy is good but powerful > isn't always easy. > > On Sunday, February 20, 2011, maha <[EMAIL PROTECTED]> wrote: >> So first question: is there a difference between Mappers and maps ? |