|
|
Usually large file in HDFS is split into bulks and store in different DataNodes. A map task is assigned to deal with that bulk, I wonder what if the Structured data(i.e a word) was split into two bulks? How MapReduce and HDFS deal with this?
Thanks! Donal
-
Re: structured data split
Denny Ye 2011-11-11, 09:50
hi Structured data is always being split into different blocks, likes a word or line. MapReduce task read HDFS data with the unit - *line* - it will read the whole line from the end of previous block to start of subsequent to obtains that part of line record. So you does not worry about the Incomplete structured data. HDFS do nothing for this mechanism.
-Regards Denny Ye
On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <[EMAIL PROTECTED]> wrote:
> Usually large file in HDFS is split into bulks and store in different > DataNodes. > A map task is assigned to deal with that bulk, I wonder what if the > Structured data(i.e a word) was split into two bulks? > How MapReduce and HDFS deal with this? > > Thanks! > Donal >
+
Denny Ye 2011-11-11, 09:50
-
Re: structured data split
臧冬松 2011-11-11, 10:11
Thanks Denny! So that means each map task will have to read from another DataNode inorder to read the end line of the previous block?
Cheers, Donal
2011/11/11 Denny Ye <[EMAIL PROTECTED]>
> hi > Structured data is always being split into different blocks, likes a > word or line. > MapReduce task read HDFS data with the unit - *line* - it will read > the whole line from the end of previous block to start of subsequent to > obtains that part of line record. So you does not worry about the > Incomplete structured data. HDFS do nothing for this mechanism. > > -Regards > Denny Ye > > > On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <[EMAIL PROTECTED]> wrote: > >> Usually large file in HDFS is split into bulks and store in different >> DataNodes. >> A map task is assigned to deal with that bulk, I wonder what if the >> Structured data(i.e a word) was split into two bulks? >> How MapReduce and HDFS deal with this? >> >> Thanks! >> Donal >> > >
-
Re: structured data split
Bejoy KS 2011-11-11, 11:01
Hi Donal You can configure your map tasks the way you like to process your input. If you have file of size 100 mb, it would be divided into two input blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is your choice on how you process the same using map reduce - With the default TextInputFormat the two blocks would be processed by two different mappers. (under default split settings) If the blocks are in two different data nodes then two different mappers mappers would be spanned in each data node in beat case. ie They are data local map tasks - If you want one mapper to process the whole file,change your input format to WholeFileInputFormat. There a mapper task would be triggred on any one of the node where the blocks are located. (best case) If both the blocks are not on the same node then one of the blocks would be transferred to the map task location for processing.
Hope it helps!...
Thank You Bejoy.K.S
2011/11/11 臧冬松 <[EMAIL PROTECTED]>
> Thanks Denny! > So that means each map task will have to read from another DataNode > inorder to read the end line of the previous block? > > Cheers, > Donal > > > 2011/11/11 Denny Ye <[EMAIL PROTECTED]> > >> hi >> Structured data is always being split into different blocks, likes a >> word or line. >> MapReduce task read HDFS data with the unit - *line* - it will read >> the whole line from the end of previous block to start of subsequent to >> obtains that part of line record. So you does not worry about the >> Incomplete structured data. HDFS do nothing for this mechanism. >> >> -Regards >> Denny Ye >> >> >> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <[EMAIL PROTECTED]> wrote: >> >>> Usually large file in HDFS is split into bulks and store in different >>> DataNodes. >>> A map task is assigned to deal with that bulk, I wonder what if the >>> Structured data(i.e a word) was split into two bulks? >>> How MapReduce and HDFS deal with this? >>> >>> Thanks! >>> Donal >>> >> >> >
+
Bejoy KS 2011-11-11, 11:01
-
Re: structured data split
臧冬松 2011-11-11, 12:46
Thanks Bejoy! It's better to process the data blocks locally and separately. I just want to know how to deal with a structure (i.e. a word,a line) that is split into two blocks.
Cheers, Donal
在 2011年11月11日 下午7:01,Bejoy KS <[EMAIL PROTECTED]>写道:
> Hi Donal > You can configure your map tasks the way you like to process your > input. If you have file of size 100 mb, it would be divided into two input > blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is > your choice on how you process the same using map reduce > - With the default TextInputFormat the two blocks would be processed by > two different mappers. (under default split settings) If the blocks are in > two different data nodes then two different mappers mappers would be > spanned in each data node in beat case. ie They are data local map tasks > - If you want one mapper to process the whole file,change your input > format to WholeFileInputFormat. There a mapper task would be triggred on > any one of the node where the blocks are located. (best case) If both the > blocks are not on the same node then one of the blocks would be transferred > to the map task location for processing. > > Hope it helps!... > > Thank You > Bejoy.K.S > > > 2011/11/11 臧冬松 <[EMAIL PROTECTED]> > >> Thanks Denny! >> So that means each map task will have to read from another DataNode >> inorder to read the end line of the previous block? >> >> Cheers, >> Donal >> >> >> 2011/11/11 Denny Ye <[EMAIL PROTECTED]> >> >>> hi >>> Structured data is always being split into different blocks, likes a >>> word or line. >>> MapReduce task read HDFS data with the unit - *line* - it will read >>> the whole line from the end of previous block to start of subsequent to >>> obtains that part of line record. So you does not worry about the >>> Incomplete structured data. HDFS do nothing for this mechanism. >>> >>> -Regards >>> Denny Ye >>> >>> >>> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <[EMAIL PROTECTED]> wrote: >>> >>>> Usually large file in HDFS is split into bulks and store in different >>>> DataNodes. >>>> A map task is assigned to deal with that bulk, I wonder what if the >>>> Structured data(i.e a word) was split into two bulks? >>>> How MapReduce and HDFS deal with this? >>>> >>>> Thanks! >>>> Donal >>>> >>> >>> >> >
-
Re: structured data split
bejoy.hadoop@... 2011-11-11, 13:25
Donal In hadoop that hardly happens so. When you are storing data in hdfs it would be split line to blocks depending on end of lines, in case of normal files. It won't be like you'd be having half of a line in one block and the rest in next one. You don't need to worry on that fact. The case you mentioned is like dependent data splits. Hadoop's massive parallel processing could be fully utilized only in case of independent data splits. When data splits are dependent on a file level as I pointed out you can go for WholeFileInputFormat.
Please revert if you are still confused. Also if you have some specific scenario, please put that across so we may be able to help you understand better on the map reduce processing of the same.
Hope it clarifies...
Regards Bejoy K S
-----Original Message----- From: 臧冬松 <[EMAIL PROTECTED]> Date: Fri, 11 Nov 2011 20:46:54 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Re: structured data split
Thanks Bejoy! It's better to process the data blocks locally and separately. I just want to know how to deal with a structure (i.e. a word,a line) that is split into two blocks.
Cheers, Donal
在 2011年11月11日 下午7:01,Bejoy KS <[EMAIL PROTECTED]>写道:
> Hi Donal > You can configure your map tasks the way you like to process your > input. If you have file of size 100 mb, it would be divided into two input > blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is > your choice on how you process the same using map reduce > - With the default TextInputFormat the two blocks would be processed by > two different mappers. (under default split settings) If the blocks are in > two different data nodes then two different mappers mappers would be > spanned in each data node in beat case. ie They are data local map tasks > - If you want one mapper to process the whole file,change your input > format to WholeFileInputFormat. There a mapper task would be triggred on > any one of the node where the blocks are located. (best case) If both the > blocks are not on the same node then one of the blocks would be transferred > to the map task location for processing. > > Hope it helps!... > > Thank You > Bejoy.K.S > > > 2011/11/11 臧冬松 <[EMAIL PROTECTED]> > >> Thanks Denny! >> So that means each map task will have to read from another DataNode >> inorder to read the end line of the previous block? >> >> Cheers, >> Donal >> >> >> 2011/11/11 Denny Ye <[EMAIL PROTECTED]> >> >>> hi >>> Structured data is always being split into different blocks, likes a >>> word or line. >>> MapReduce task read HDFS data with the unit - *line* - it will read >>> the whole line from the end of previous block to start of subsequent to >>> obtains that part of line record. So you does not worry about the >>> Incomplete structured data. HDFS do nothing for this mechanism. >>> >>> -Regards >>> Denny Ye >>> >>> >>> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <[EMAIL PROTECTED]> wrote: >>> >>>> Usually large file in HDFS is split into bulks and store in different >>>> DataNodes. >>>> A map task is assigned to deal with that bulk, I wonder what if the >>>> Structured data(i.e a word) was split into two bulks? >>>> How MapReduce and HDFS deal with this? >>>> >>>> Thanks! >>>> Donal >>>> >>> >>> >> >
+
bejoy.hadoop@... 2011-11-11, 13:25
-
Re: structured data split
Harsh J 2011-11-11, 13:54
Bejoy, This is incorrect. As Denny had explained earlier, blocks are split along byte sizes alone. The writer does not concern itself with newlines and such. When reading, the record readers align themselves to read till the end of lines by communicating with the next block if they have to. This is explained neatly under http://wiki.apache.org/Hadoop/MapReduceArch, para 2 of Map. Regarding structured data, such as XML, one can write their custom InputFormat that returns appropriate split points after scanning through the entire file pre-submit (say, by looking at tags). However, if you want XML, then there is already an XMLInputFormat available in Mahout. For reading N lines at a time, use NLineInputFormat. On 11-Nov-2011, at 6:55 PM, [EMAIL PROTECTED] wrote: > Donal > In hadoop that hardly happens so. When you are storing data in hdfs it would be split line to blocks depending on end of lines, in case of normal files. It won't be like you'd be having half of a line in one block and the rest in next one. You don't need to worry on that fact. > The case you mentioned is like dependent data splits. Hadoop's massive parallel processing could be fully utilized only in case of independent data splits. When data splits are dependent on a file level as I pointed out you can go for WholeFileInputFormat. > > Please revert if you are still confused. Also if you have some specific scenario, please put that across so we may be able to help you understand better on the map reduce processing of the same. > > Hope it clarifies... > Regards > Bejoy K S > From: 臧冬松 <[EMAIL PROTECTED]> > Date: Fri, 11 Nov 2011 20:46:54 +0800 > To: <[EMAIL PROTECTED]> > ReplyTo: [EMAIL PROTECTED] > Subject: Re: structured data split > > Thanks Bejoy! > It's better to process the data blocks locally and separately. > I just want to know how to deal with a structure (i.e. a word,a line) that is split into two blocks. > > Cheers, > Donal > > 在 2011年11月11日 下午7:01��Bejoy KS <[EMAIL PROTECTED]>写道: > Hi Donal > You can configure your map tasks the way you like to process your input. If you have file of size 100 mb, it would be divided into two input blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is your choice on how you process the same using map reduce > - With the default TextInputFormat the two blocks would be processed by two different mappers. (under default split settings) If the blocks are in two different data nodes then two different mappers mappers would be spanned in each data node in beat case. ie They are data local map tasks > - If you want one mapper to process the whole file,change your input format to WholeFileInputFormat. There a mapper task would be triggred on any one of the node where the blocks are located. (best case) If both the blocks are not on the same node then one of the blocks would be transferred to the map task location for processing. > > Hope it helps!... > > Thank You > Bejoy.K.S > > > 2011/11/11 臧冬松 <[EMAIL PROTECTED]> > Thanks Denny! > So that means each map task will have to read from another DataNode inorder to read the end line of the previous block? > > Cheers, > Donal > > > 2011/11/11 Denny Ye <[EMAIL PROTECTED]> > hi > Structured data is always being split into different blocks, likes a word or line. > MapReduce task read HDFS data with the unit - line - it will read the whole line from the end of previous block to start of subsequent to obtains that part of line record. So you does not worry about the Incomplete structured data. HDFS do nothing for this mechanism. > > -Regards > Denny Ye > > > On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <[EMAIL PROTECTED]> wrote: > Usually large file in HDFS is split into bulks and store in different DataNodes. > A map task is assigned to deal with that bulk, I wonder what if the Structured data(i.e a word) was split into two bulks? > How MapReduce and HDFS deal with this? > > Thanks! > Donal > >
+
Harsh J 2011-11-11, 13:54
-
Re: structured data split
Bejoy KS 2011-11-11, 14:38
Thanks Harsh for correcting me with that wonderful piece of information . Cleared a wrong assumption on hdfs storage fundamentals today. Sorry Donal for confusing you over the same. Harsh, Looks like the link is broken, it'd be great if you could post the url once more. Thanks a lot Regards Bejoy.K.S On Fri, Nov 11, 2011 at 7:24 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Bejoy, > > This is incorrect. As Denny had explained earlier, blocks are split along > byte sizes alone. The writer does not concern itself with newlines and > such. When reading, the record readers align themselves to read till the > end of lines by communicating with the next block if they have to. > > This is explained neatly under http://wiki.apache.org/Hadoop/MapReduceArch, > para 2 of Map. > > Regarding structured data, such as XML, one can write their custom > InputFormat that returns appropriate split points after scanning through > the entire file pre-submit (say, by looking at tags). > > However, if you want XML, then there is already an XMLInputFormat > available in Mahout. For reading N lines at a time, use NLineInputFormat. > > On 11-Nov-2011, at 6:55 PM, [EMAIL PROTECTED] wrote: > > Donal > In hadoop that hardly happens so. When you are storing data in hdfs it > would be split line to blocks depending on end of lines, in case of normal > files. It won't be like you'd be having half of a line in one block and the > rest in next one. You don't need to worry on that fact. > The case you mentioned is like dependent data splits. Hadoop's massive > parallel processing could be fully utilized only in case of independent > data splits. When data splits are dependent on a file level as I pointed > out you can go for WholeFileInputFormat. > > Please revert if you are still confused. Also if you have some specific > scenario, please put that across so we may be able to help you understand > better on the map reduce processing of the same. > > Hope it clarifies... > Regards > Bejoy K S > ------------------------------ > *From: * 臧冬松 <[EMAIL PROTECTED]> > *Date: *Fri, 11 Nov 2011 20:46:54 +0800 > *To: *<[EMAIL PROTECTED]> > *ReplyTo: * [EMAIL PROTECTED] > *Subject: *Re: structured data split > > Thanks Bejoy! > It's better to process the data blocks locally and separately. > I just want to know how to deal with a structure (i.e. a word,a line) that > is split into two blocks. > > Cheers, > Donal > > 在 2011年11月11日 下午7:01,Bejoy KS <[EMAIL PROTECTED]>写道: > >> Hi Donal >> You can configure your map tasks the way you like to process your >> input. If you have file of size 100 mb, it would be divided into two input >> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is >> your choice on how you process the same using map reduce >> - With the default TextInputFormat the two blocks would be processed by >> two different mappers. (under default split settings) If the blocks are in >> two different data nodes then two different mappers mappers would be >> spanned in each data node in beat case. ie They are data local map tasks >> - If you want one mapper to process the whole file,change your input >> format to WholeFileInputFormat. There a mapper task would be triggred on >> any one of the node where the blocks are located. (best case) If both the >> blocks are not on the same node then one of the blocks would be transferred >> to the map task location for processing. >> >> Hope it helps!... >> >> Thank You >> Bejoy.K.S >> >> >> 2011/11/11 臧冬松 <[EMAIL PROTECTED]> >> >>> Thanks Denny! >>> So that means each map task will have to read from another DataNode >>> inorder to read the end line of the previous block? >>> >>> Cheers, >>> Donal >>> >>> >>> 2011/11/11 Denny Ye <[EMAIL PROTECTED]> >>> >>>> hi >>>> Structured data is always being split into different blocks, likes a >>>> word or line. >>>> MapReduce task read HDFS data with the unit - *line* - it will read >>>> the whole line from the end of previous block to start of subsequent to
+
Bejoy KS 2011-11-11, 14:38
-
Re: structured data split
Harsh J 2011-11-11, 16:06
Sorry Bejoy, I'd typed that URL out from what I remembered on my mind. Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce2011/11/11 Bejoy KS <[EMAIL PROTECTED]>: > Thanks Harsh for correcting me with that wonderful piece of information . > Cleared a wrong assumption on hdfs storage fundamentals today. > > Sorry Donal for confusing you over the same. > > Harsh, > Looks like the link is broken, it'd be great if you could post the > url once more. > > Thanks a lot > > Regards > Bejoy.K.S > > On Fri, Nov 11, 2011 at 7:24 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> >> Bejoy, >> This is incorrect. As Denny had explained earlier, blocks are split along >> byte sizes alone. The writer does not concern itself with newlines and such. >> When reading, the record readers align themselves to read till the end of >> lines by communicating with the next block if they have to. >> This is explained neatly under >> http://wiki.apache.org/Hadoop/MapReduceArch, para 2 of Map. >> Regarding structured data, such as XML, one can write their custom >> InputFormat that returns appropriate split points after scanning through the >> entire file pre-submit (say, by looking at tags). >> However, if you want XML, then there is already an XMLInputFormat >> available in Mahout. For reading N lines at a time, use NLineInputFormat. >> On 11-Nov-2011, at 6:55 PM, [EMAIL PROTECTED] wrote: >> >> Donal >> In hadoop that hardly happens so. When you are storing data in hdfs it >> would be split line to blocks depending on end of lines, in case of normal >> files. It won't be like you'd be having half of a line in one block and the >> rest in next one. You don't need to worry on that fact. >> The case you mentioned is like dependent data splits. Hadoop's massive >> parallel processing could be fully utilized only in case of independent data >> splits. When data splits are dependent on a file level as I pointed out you >> can go for WholeFileInputFormat. >> >> Please revert if you are still confused. Also if you have some specific >> scenario, please put that across so we may be able to help you understand >> better on the map reduce processing of the same. >> >> Hope it clarifies... >> Regards >> Bejoy K S >> ________________________________ >> From: 臧冬松 <[EMAIL PROTECTED]> >> Date: Fri, 11 Nov 2011 20:46:54 +0800 >> To: <[EMAIL PROTECTED]> >> ReplyTo: [EMAIL PROTECTED] >> Subject: Re: structured data split >> Thanks Bejoy! >> It's better to process the data blocks locally and separately. >> I just want to know how to deal with a structure (i.e. a word,a line) that >> is split into two blocks. >> >> Cheers, >> Donal >> >> 在 2011年11月11日 下午7:01,Bejoy KS <[EMAIL PROTECTED]>写道: >>> >>> Hi Donal >>> You can configure your map tasks the way you like to process your >>> input. If you have file of size 100 mb, it would be divided into two input >>> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is >>> your choice on how you process the same using map reduce >>> - With the default TextInputFormat the two blocks would be processed by >>> two different mappers. (under default split settings) If the blocks are in >>> two different data nodes then two different mappers mappers would be spanned >>> in each data node in beat case. ie They are data local map tasks >>> - If you want one mapper to process the whole file,change your input >>> format to WholeFileInputFormat. There a mapper task would be triggred on any >>> one of the node where the blocks are located. (best case) If both the blocks >>> are not on the same node then one of the blocks would be transferred to the >>> map task location for processing. >>> >>> Hope it helps!... >>> >>> Thank You >>> Bejoy.K.S >>> >>> 2011/11/11 臧冬松 <[EMAIL PROTECTED]> >>>> >>>> Thanks Denny! >>>> So that means each map task will have to read from another DataNode >>>> inorder to read the end line of the previous block? >>>> >>>> Cheers, >>>> Donal >>>> Harsh J
+
Harsh J 2011-11-11, 16:06
-
Re: structured data split
Bejoy KS 2011-11-11, 16:27
Thanks Harsh !... 2011/11/11 Harsh J <[EMAIL PROTECTED]> > Sorry Bejoy, I'd typed that URL out from what I remembered on my mind. > Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce> > 2011/11/11 Bejoy KS <[EMAIL PROTECTED]>: > > Thanks Harsh for correcting me with that wonderful piece of information . > > Cleared a wrong assumption on hdfs storage fundamentals today. > > > > Sorry Donal for confusing you over the same. > > > > Harsh, > > Looks like the link is broken, it'd be great if you could post the > > url once more. > > > > Thanks a lot > > > > Regards > > Bejoy.K.S > > > > On Fri, Nov 11, 2011 at 7:24 PM, Harsh J <[EMAIL PROTECTED]> wrote: > >> > >> Bejoy, > >> This is incorrect. As Denny had explained earlier, blocks are split > along > >> byte sizes alone. The writer does not concern itself with newlines and > such. > >> When reading, the record readers align themselves to read till the end > of > >> lines by communicating with the next block if they have to. > >> This is explained neatly under > >> http://wiki.apache.org/Hadoop/MapReduceArch, para 2 of Map. > >> Regarding structured data, such as XML, one can write their custom > >> InputFormat that returns appropriate split points after scanning > through the > >> entire file pre-submit (say, by looking at tags). > >> However, if you want XML, then there is already an XMLInputFormat > >> available in Mahout. For reading N lines at a time, use > NLineInputFormat. > >> On 11-Nov-2011, at 6:55 PM, [EMAIL PROTECTED] wrote: > >> > >> Donal > >> In hadoop that hardly happens so. When you are storing data in hdfs it > >> would be split line to blocks depending on end of lines, in case of > normal > >> files. It won't be like you'd be having half of a line in one block and > the > >> rest in next one. You don't need to worry on that fact. > >> The case you mentioned is like dependent data splits. Hadoop's massive > >> parallel processing could be fully utilized only in case of independent > data > >> splits. When data splits are dependent on a file level as I pointed out > you > >> can go for WholeFileInputFormat. > >> > >> Please revert if you are still confused. Also if you have some specific > >> scenario, please put that across so we may be able to help you > understand > >> better on the map reduce processing of the same. > >> > >> Hope it clarifies... > >> Regards > >> Bejoy K S > >> ________________________________ > >> From: 臧冬松 <[EMAIL PROTECTED]> > >> Date: Fri, 11 Nov 2011 20:46:54 +0800 > >> To: <[EMAIL PROTECTED]> > >> ReplyTo: [EMAIL PROTECTED] > >> Subject: Re: structured data split > >> Thanks Bejoy! > >> It's better to process the data blocks locally and separately. > >> I just want to know how to deal with a structure (i.e. a word,a line) > that > >> is split into two blocks. > >> > >> Cheers, > >> Donal > >> > >> 在 2011年11月11日 下午7:01,Bejoy KS <[EMAIL PROTECTED]>写道: > >>> > >>> Hi Donal > >>> You can configure your map tasks the way you like to process your > >>> input. If you have file of size 100 mb, it would be divided into two > input > >>> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). > It is > >>> your choice on how you process the same using map reduce > >>> - With the default TextInputFormat the two blocks would be processed by > >>> two different mappers. (under default split settings) If the blocks > are in > >>> two different data nodes then two different mappers mappers would be > spanned > >>> in each data node in beat case. ie They are data local map tasks > >>> - If you want one mapper to process the whole file,change your input > >>> format to WholeFileInputFormat. There a mapper task would be triggred > on any > >>> one of the node where the blocks are located. (best case) If both the > blocks > >>> are not on the same node then one of the blocks would be transferred > to the > >>> map task location for processing. > >>> > >>> Hope it helps!... > >
+
Bejoy KS 2011-11-11, 16:27
-
Re: structured data split
臧冬松 2011-11-11, 14:12
Hi Bejoy, I don't understand why it's impossible to have half of a line in one block, since the file is split into fixed size of blocks. My scenario is that I have lots of files from High Energy Physics experiment. These files are in binary format,about 2G each, but basically they are composed by lots of "Event", each Event is independent with others. The physicists use a C++ program called ROOT to analysis these files,and write the output to a result file(use open(),read(),write()). I'm considering how to store the files in HDFS, and use the Map-reduce to analize them. Tell me if that's not clear.
Cheers, Donal 2011/11/11 <[EMAIL PROTECTED]>
> ** > Donal > In hadoop that hardly happens so. When you are storing data in hdfs it > would be split line to blocks depending on end of lines, in case of normal > files. It won't be like you'd be having half of a line in one block and the > rest in next one. You don't need to worry on that fact. > The case you mentioned is like dependent data splits. Hadoop's massive > parallel processing could be fully utilized only in case of independent > data splits. When data splits are dependent on a file level as I pointed > out you can go for WholeFileInputFormat. > > Please revert if you are still confused. Also if you have some specific > scenario, please put that across so we may be able to help you understand > better on the map reduce processing of the same. > > Hope it clarifies... > Regards > Bejoy K S > ------------------------------ > *From: * 臧冬松 <[EMAIL PROTECTED]> > *Date: *Fri, 11 Nov 2011 20:46:54 +0800 > *To: *<[EMAIL PROTECTED]> > *ReplyTo: * [EMAIL PROTECTED] > *Subject: *Re: structured data split > > Thanks Bejoy! > It's better to process the data blocks locally and separately. > I just want to know how to deal with a structure (i.e. a word,a line) that > is split into two blocks. > > Cheers, > Donal > > 在 2011年11月11日 下午7:01,Bejoy KS <[EMAIL PROTECTED]>写道: > >> Hi Donal >> You can configure your map tasks the way you like to process your >> input. If you have file of size 100 mb, it would be divided into two input >> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is >> your choice on how you process the same using map reduce >> - With the default TextInputFormat the two blocks would be processed by >> two different mappers. (under default split settings) If the blocks are in >> two different data nodes then two different mappers mappers would be >> spanned in each data node in beat case. ie They are data local map tasks >> - If you want one mapper to process the whole file,change your input >> format to WholeFileInputFormat. There a mapper task would be triggred on >> any one of the node where the blocks are located. (best case) If both the >> blocks are not on the same node then one of the blocks would be transferred >> to the map task location for processing. >> >> Hope it helps!... >> >> Thank You >> Bejoy.K.S >> >> >> 2011/11/11 臧冬松 <[EMAIL PROTECTED]> >> >>> Thanks Denny! >>> So that means each map task will have to read from another DataNode >>> inorder to read the end line of the previous block? >>> >>> Cheers, >>> Donal >>> >>> >>> 2011/11/11 Denny Ye <[EMAIL PROTECTED]> >>> >>>> hi >>>> Structured data is always being split into different blocks, likes a >>>> word or line. >>>> MapReduce task read HDFS data with the unit - *line* - it will read >>>> the whole line from the end of previous block to start of subsequent to >>>> obtains that part of line record. So you does not worry about the >>>> Incomplete structured data. HDFS do nothing for this mechanism. >>>> >>>> -Regards >>>> Denny Ye >>>> >>>> >>>> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <[EMAIL PROTECTED]> wrote: >>>> >>>>> Usually large file in HDFS is split into bulks and store in different >>>>> DataNodes. >>>>> A map task is assigned to deal with that bulk, I wonder what if the >>>>> Structured data(i.e a word) was split into two bulks?
-
Re: structured data split
Will Maier 2011-11-11, 14:26
Hi Donal- On Fri, Nov 11, 2011 at 10:12:44PM +0800, ?????? wrote: > My scenario is that I have lots of files from High Energy Physics experiment. > These files are in binary format,about 2G each, but basically they are > composed by lots of "Event", each Event is independent with others. The > physicists use a C++ program called ROOT to analysis these files,and write the > output to a result file(use open(),read(),write()). I'm considering how to > store the files in HDFS, and use the Map-reduce to analize them. May I ask which experiment you're working on? We run a HDFS cluster at one of the analysis centers for the CMS detector at the LHC. I'm not aware of anyone using Hadoop's MR for analysis, though about 10 PB of LHC data is now stored in HDFS. For your/our use case, I think that you would have to implement a domain-specific InputFormat yielding Events. ROOT files would be stored as-is in HDFS. In CMS, we mostly run traditional HEP simulation and analysis workflows using plain batch jobs managed by common schedulers like Condor or PBS. These of course lack some of the features of the MR schedulers (like location awareness), but have some advantages. For example, we run Condor schedulers that transparently manage workflows of tens of thousands of jobs on dozens of heterogeneous clusters across North America. Feel free to contact me off-list if have more HEP-specific questions about HDFS. Thanks! -- Will Maier - UW High Energy Physics cel: 608.438.6162 tel: 608.263.9692 web: http://www.hep.wisc.edu/~wcmaier/
+
Will Maier 2011-11-11, 14:26
-
Re: structured data split
Charles Earl 2011-11-11, 14:42
Hi, Please also feel free to contact me. I'm working with STAR project at Brookhaven Lab, and we are trying to build a MR workflow for analysis of particle data. I've done some preliminary experiments running Root and other nuclear physics analysis software in MR and have been looking at various file layouts. Charles On Nov 11, 2011, at 9:26 AM, Will Maier wrote: > Hi Donal- > > On Fri, Nov 11, 2011 at 10:12:44PM +0800, ?????? wrote: >> My scenario is that I have lots of files from High Energy Physics experiment. >> These files are in binary format,about 2G each, but basically they are >> composed by lots of "Event", each Event is independent with others. The >> physicists use a C++ program called ROOT to analysis these files,and write the >> output to a result file(use open(),read(),write()). I'm considering how to >> store the files in HDFS, and use the Map-reduce to analize them. > > May I ask which experiment you're working on? We run a HDFS cluster at one of > the analysis centers for the CMS detector at the LHC. I'm not aware of anyone > using Hadoop's MR for analysis, though about 10 PB of LHC data is now stored in > HDFS. For your/our use case, I think that you would have to implement a > domain-specific InputFormat yielding Events. ROOT files would be stored as-is in > HDFS. > > In CMS, we mostly run traditional HEP simulation and analysis workflows using > plain batch jobs managed by common schedulers like Condor or PBS. These of > course lack some of the features of the MR schedulers (like location awareness), > but have some advantages. For example, we run Condor schedulers that > transparently manage workflows of tens of thousands of jobs on dozens of > heterogeneous clusters across North America. > > Feel free to contact me off-list if have more HEP-specific questions about HDFS. > > Thanks! > > -- > > Will Maier - UW High Energy Physics > cel: 608.438.6162 > tel: 608.263.9692 > web: http://www.hep.wisc.edu/~wcmaier/
+
Charles Earl 2011-11-11, 14:42
-
Re: structured data split
Bejoy KS 2011-11-11, 15:10
Hi Donal I don't have much of an expose to the domain which you are pointing on to, but from a plain map reduce developer terms there would be my way of looking into processing such data format with map reduce - If the data is kind of flowing in continuously then I'd use flume to collect the binary data and write the same into sequence files and load into hdfs - If it is already existing large data, I'd use a sequence file writer to write the binary data as sequence files into hdfs. Where hdfs would take care of the splits. - I'd use SequenceFileInputFormat for my map reduce - If my application code is in other compatible language than java then I'd be using Streaming API to trigger my map reduce job. If there is any specific constraints with reading your data, as Will metioned you may need to go in with your custom Input Formats for processing the same. Hope it helps!... On Fri, Nov 11, 2011 at 8:12 PM, Charles Earl <[EMAIL PROTECTED]> wrote: > Hi, > Please also feel free to contact me. I'm working with STAR project at > Brookhaven Lab, and we are trying to build a MR workflow for analysis of > particle data. I've done some preliminary experiments running Root and > other nuclear physics analysis software in MR and have been looking at > various file layouts. > Charles > On Nov 11, 2011, at 9:26 AM, Will Maier wrote: > > > Hi Donal- > > > > On Fri, Nov 11, 2011 at 10:12:44PM +0800, ?????? wrote: > >> My scenario is that I have lots of files from High Energy Physics > experiment. > >> These files are in binary format,about 2G each, but basically they are > >> composed by lots of "Event", each Event is independent with others. The > >> physicists use a C++ program called ROOT to analysis these files,and > write the > >> output to a result file(use open(),read(),write()). I'm considering > how to > >> store the files in HDFS, and use the Map-reduce to analize them. > > > > May I ask which experiment you're working on? We run a HDFS cluster at > one of > > the analysis centers for the CMS detector at the LHC. I'm not aware of > anyone > > using Hadoop's MR for analysis, though about 10 PB of LHC data is now > stored in > > HDFS. For your/our use case, I think that you would have to implement a > > domain-specific InputFormat yielding Events. ROOT files would be stored > as-is in > > HDFS. > > > > In CMS, we mostly run traditional HEP simulation and analysis workflows > using > > plain batch jobs managed by common schedulers like Condor or PBS. These > of > > course lack some of the features of the MR schedulers (like location > awareness), > > but have some advantages. For example, we run Condor schedulers that > > transparently manage workflows of tens of thousands of jobs on dozens of > > heterogeneous clusters across North America. > > > > Feel free to contact me off-list if have more HEP-specific questions > about HDFS. > > > > Thanks! > > > > -- > > > > Will Maier - UW High Energy Physics > > cel: 608.438.6162 > > tel: 608.263.9692 > > web: http://www.hep.wisc.edu/~wcmaier/> >
+
Bejoy KS 2011-11-11, 15:10
-
Re: structured data split
臧冬松 2011-11-11, 15:57
Thanks Bejoy, that help a lot! 2011/11/11, Bejoy KS <[EMAIL PROTECTED]>: > Hi Donal > I don't have much of an expose to the domain which you are > pointing on to, but from a plain map reduce developer terms there would be > my way of looking into processing such data format with map reduce > - If the data is kind of flowing in continuously then I'd use flume to > collect the binary data and write the same into sequence files and load > into hdfs > - If it is already existing large data, I'd use a sequence file writer to > write the binary data as sequence files into hdfs. Where hdfs would take > care of the splits. > - I'd use SequenceFileInputFormat for my map reduce > - If my application code is in other compatible language than java then I'd > be using Streaming API to trigger my map reduce job. > > If there is any specific constraints with reading your data, as Will > metioned you may need to go in with your custom Input Formats for > processing the same. > > > Hope it helps!... > > > On Fri, Nov 11, 2011 at 8:12 PM, Charles Earl <[EMAIL PROTECTED]> wrote: > >> Hi, >> Please also feel free to contact me. I'm working with STAR project at >> Brookhaven Lab, and we are trying to build a MR workflow for analysis of >> particle data. I've done some preliminary experiments running Root and >> other nuclear physics analysis software in MR and have been looking at >> various file layouts. >> Charles >> On Nov 11, 2011, at 9:26 AM, Will Maier wrote: >> >> > Hi Donal- >> > >> > On Fri, Nov 11, 2011 at 10:12:44PM +0800, ?????? wrote: >> >> My scenario is that I have lots of files from High Energy Physics >> experiment. >> >> These files are in binary format,about 2G each, but basically they are >> >> composed by lots of "Event", each Event is independent with others. The >> >> physicists use a C++ program called ROOT to analysis these files,and >> write the >> >> output to a result file(use open(),read(),write()). I'm considering >> how to >> >> store the files in HDFS, and use the Map-reduce to analize them. >> > >> > May I ask which experiment you're working on? We run a HDFS cluster at >> one of >> > the analysis centers for the CMS detector at the LHC. I'm not aware of >> anyone >> > using Hadoop's MR for analysis, though about 10 PB of LHC data is now >> stored in >> > HDFS. For your/our use case, I think that you would have to implement a >> > domain-specific InputFormat yielding Events. ROOT files would be stored >> as-is in >> > HDFS. >> > >> > In CMS, we mostly run traditional HEP simulation and analysis workflows >> using >> > plain batch jobs managed by common schedulers like Condor or PBS. These >> of >> > course lack some of the features of the MR schedulers (like location >> awareness), >> > but have some advantages. For example, we run Condor schedulers that >> > transparently manage workflows of tens of thousands of jobs on dozens of >> > heterogeneous clusters across North America. >> > >> > Feel free to contact me off-list if have more HEP-specific questions >> about HDFS. >> > >> > Thanks! >> > >> > -- >> > >> > Will Maier - UW High Energy Physics >> > cel: 608.438.6162 >> > tel: 608.263.9692 >> > web: http://www.hep.wisc.edu/~wcmaier/>> >> >
-
Re: structured data split
臧冬松 2011-11-14, 08:32
Hi Charles, Can you describe your MR workflow? Do you use MR for reconstruction , analysis or simulation jobs? What's the layout of the input and output files, ROOT? NTuple? How do you split the input and merge the result? Thanks! Donal 2011/11/11 Charles Earl <[EMAIL PROTECTED]> > Hi, > Please also feel free to contact me. I'm working with STAR project at > Brookhaven Lab, and we are trying to build a MR workflow for analysis of > particle data. I've done some preliminary experiments running Root and > other nuclear physics analysis software in MR and have been looking at > various file layouts. > Charles > On Nov 11, 2011, at 9:26 AM, Will Maier wrote: > > > Hi Donal- > > > > On Fri, Nov 11, 2011 at 10:12:44PM +0800, ?????? wrote: > >> My scenario is that I have lots of files from High Energy Physics > experiment. > >> These files are in binary format,about 2G each, but basically they are > >> composed by lots of "Event", each Event is independent with others. The > >> physicists use a C++ program called ROOT to analysis these files,and > write the > >> output to a result file(use open(),read(),write()). I'm considering > how to > >> store the files in HDFS, and use the Map-reduce to analize them. > > > > May I ask which experiment you're working on? We run a HDFS cluster at > one of > > the analysis centers for the CMS detector at the LHC. I'm not aware of > anyone > > using Hadoop's MR for analysis, though about 10 PB of LHC data is now > stored in > > HDFS. For your/our use case, I think that you would have to implement a > > domain-specific InputFormat yielding Events. ROOT files would be stored > as-is in > > HDFS. > > > > In CMS, we mostly run traditional HEP simulation and analysis workflows > using > > plain batch jobs managed by common schedulers like Condor or PBS. These > of > > course lack some of the features of the MR schedulers (like location > awareness), > > but have some advantages. For example, we run Condor schedulers that > > transparently manage workflows of tens of thousands of jobs on dozens of > > heterogeneous clusters across North America. > > > > Feel free to contact me off-list if have more HEP-specific questions > about HDFS. > > > > Thanks! > > > > -- > > > > Will Maier - UW High Energy Physics > > cel: 608.438.6162 > > tel: 608.263.9692 > > web: http://www.hep.wisc.edu/~wcmaier/> >
|
|