|
hari708
2011-11-22, 01:20
hari708
2011-11-22, 01:20
Uma Maheswara Rao G
2011-11-22, 03:03
Uma Maheswara Rao G
2011-11-22, 03:08
Michael Segel
2011-11-22, 03:58
Inder Pall
2011-11-22, 04:01
Bejoy Ks
2011-11-22, 07:33
Steve Loughran
2011-11-22, 11:19
Joey Echeverria
2011-11-22, 11:20
Mridul Muralidharan
2011-11-22, 12:48
|
-
Regarding loading a big XML file to HDFShari708 2011-11-22, 01:20
Hi, I have a big file consisting of XML data.the XML is not represented as a single line in the file. if we stream this file using ./hadoop dfs -put command to a hadoop directory .How the distribution happens.? Basically in My mapreduce program i am expecting a complete XML as my input.i have a CustomReader(for XML) in my mapreduce job configuration.My main confusion is if namenode distribute data to DataNodes ,there is a chance that a part of xml can go to one data node and other half can go in another datanode.If that is the case will my custom XMLReader in the mapreduce be able to combine it(as mapreduce reads data locally only). Please help me on this? -- View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS-tp32871900p32871900.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
-
Regarding loading a big XML file to HDFShari708 2011-11-22, 01:20
Hi, I have a big file consisting of XML data.the XML is not represented as a single line in the file. if we stream this file using ./hadoop dfs -put command to a hadoop directory .How the distribution happens.? Basically in My mapreduce program i am expecting a complete XML as my input.i have a CustomReader(for XML) in my mapreduce job configuration.My main confusion is if namenode distribute data to DataNodes ,there is a chance that a part of xml can go to one data node and other half can go in another datanode.If that is the case will my custom XMLReader in the mapreduce be able to combine it(as mapreduce reads data locally only). Please help me on this? -- View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS-tp32871901p32871901.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
-
RE: Regarding loading a big XML file to HDFSUma Maheswara Rao G 2011-11-22, 03:03
>______________________________________ >From: hari708 [[EMAIL PROTECTED]] >Sent: Tuesday, November 22, 2011 6:50 AM >To: [EMAIL PROTECTED] >Subject: Regarding loading a big XML file to HDFS >Hi, >I have a big file consisting of XML data.the XML is not represented as a >single line in the file. if we stream this file using ./hadoop dfs -put >command to a hadoop directory .How the distribution happens.? HDFS will didvide the blocks based on your block size configured for the file. >Basically in My mapreduce program i am expecting a complete XML as my >input.i have a CustomReader(for XML) in my mapreduce job configuration.My >main confusion is if namenode distribute data to DataNodes ,there is a >chance that a part of xml can go to one data node and other half can go in >another datanode.If that is the case will my custom XMLReader in the >mapreduce be able to combine it(as mapreduce reads data locally only). >Please help me on this? if you can not do anything parallel here, make your input split size to cover complete file size. also configure the block size to cover complete file size. In this case, only one mapper and reducer will be spawned for file. But here you wont get any parallel processing advantage. >-- >View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS->tp32871900p32871900.html >Sent from the Hadoop core-user mailing list archive at Nabble.com.
-
RE: Regarding loading a big XML file to HDFSUma Maheswara Rao G 2011-11-22, 03:08
Also i am surprising, how you are writing mapreduce application here. Map and reduce will work with key value pairs.
________________________________________ From: Uma Maheswara Rao G Sent: Tuesday, November 22, 2011 8:33 AM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: Regarding loading a big XML file to HDFS >______________________________________ >From: hari708 [[EMAIL PROTECTED]] >Sent: Tuesday, November 22, 2011 6:50 AM >To: [EMAIL PROTECTED] >Subject: Regarding loading a big XML file to HDFS >Hi, >I have a big file consisting of XML data.the XML is not represented as a >single line in the file. if we stream this file using ./hadoop dfs -put >command to a hadoop directory .How the distribution happens.? HDFS will didvide the blocks based on your block size configured for the file. >Basically in My mapreduce program i am expecting a complete XML as my >input.i have a CustomReader(for XML) in my mapreduce job configuration.My >main confusion is if namenode distribute data to DataNodes ,there is a >chance that a part of xml can go to one data node and other half can go in >another datanode.If that is the case will my custom XMLReader in the >mapreduce be able to combine it(as mapreduce reads data locally only). >Please help me on this? if you can not do anything parallel here, make your input split size to cover complete file size. also configure the block size to cover complete file size. In this case, only one mapper and reducer will be spawned for file. But here you wont get any parallel processing advantage. >-- >View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS->tp32871900p32871900.html >Sent from the Hadoop core-user mailing list archive at Nabble.com.
-
RE: Regarding loading a big XML file to HDFSMichael Segel 2011-11-22, 03:58
Just wanted to address this: > >Basically in My mapreduce program i am expecting a complete XML as my > >input.i have a CustomReader(for XML) in my mapreduce job configuration.My > >main confusion is if namenode distribute data to DataNodes ,there is a > >chance that a part of xml can go to one data node and other half can go in > >another datanode.If that is the case will my custom XMLReader in the > >mapreduce be able to combine it(as mapreduce reads data locally only). > >Please help me on this? > > if you can not do anything parallel here, make your input split size to cover complete file size. > also configure the block size to cover complete file size. In this case, only one mapper and reducer will be spawned for file. But here you wont get any parallel processing advantage. > You can do this in parallel. You need to write a custom input format class. (Which is what you're already doing...) Lets see if I can explain this correctly. You have an XML record split across block A and block B. Your map reduce job will instantiate a task per block. So in mapper processing block A, you read and process the XML records... when you get to the last record, which is only in part of A, mapper A will continue on to block B and continue reading the last record. Then stops. In mapper for block B, the reader will skip and not process data until it sees the start of a record. So you end up getting all of your XML records processed (no duplication) and done in parallel. Does that make sense? -Mike > Date: Tue, 22 Nov 2011 03:08:20 +0000 > From: [EMAIL PROTECTED] > Subject: RE: Regarding loading a big XML file to HDFS > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > > Also i am surprising, how you are writing mapreduce application here. Map and reduce will work with key value pairs. > ________________________________________ > From: Uma Maheswara Rao G > Sent: Tuesday, November 22, 2011 8:33 AM > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: RE: Regarding loading a big XML file to HDFS > > >______________________________________ > >From: hari708 [[EMAIL PROTECTED]] > >Sent: Tuesday, November 22, 2011 6:50 AM > >To: [EMAIL PROTECTED] > >Subject: Regarding loading a big XML file to HDFS > > >Hi, > >I have a big file consisting of XML data.the XML is not represented as a > >single line in the file. if we stream this file using ./hadoop dfs -put > >command to a hadoop directory .How the distribution happens.? > > HDFS will didvide the blocks based on your block size configured for the file. > > >Basically in My mapreduce program i am expecting a complete XML as my > >input.i have a CustomReader(for XML) in my mapreduce job configuration.My > >main confusion is if namenode distribute data to DataNodes ,there is a > >chance that a part of xml can go to one data node and other half can go in > >another datanode.If that is the case will my custom XMLReader in the > >mapreduce be able to combine it(as mapreduce reads data locally only). > >Please help me on this? > > if you can not do anything parallel here, make your input split size to cover complete file size. > also configure the block size to cover complete file size. In this case, only one mapper and reducer will be spawned for file. But here you wont get any parallel processing advantage. > > >-- > >View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS->tp32871900p32871900.html > >Sent from the Hadoop core-user mailing list archive at Nabble.com. >
-
Re: Regarding loading a big XML file to HDFSInder Pall 2011-11-22, 04:01
what about the records at skipped boundaries?
Instead is there a way to define a custom splitter in hadoop which can understand record boundaries. - Inder On Tue, Nov 22, 2011 at 9:28 AM, Michael Segel <[EMAIL PROTECTED]>wrote: > > Just wanted to address this: > > >Basically in My mapreduce program i am expecting a complete XML as my > > >input.i have a CustomReader(for XML) in my mapreduce job > configuration.My > > >main confusion is if namenode distribute data to DataNodes ,there is a > > >chance that a part of xml can go to one data node and other half can go > in > > >another datanode.If that is the case will my custom XMLReader in the > > >mapreduce be able to combine it(as mapreduce reads data locally only). > > >Please help me on this? > > > > if you can not do anything parallel here, make your input split size to > cover complete file size. > > > also configure the block size to cover complete file size. In this > case, only one mapper and reducer will be spawned for file. But here you > wont get any parallel processing advantage. > > > > You can do this in parallel. > You need to write a custom input format class. (Which is what you're > already doing...) > > Lets see if I can explain this correctly. > You have an XML record split across block A and block B. > > Your map reduce job will instantiate a task per block. > So in mapper processing block A, you read and process the XML records... > when you get to the last record, which is only in part of A, mapper A will > continue on to block B and continue reading the last record. Then stops. > In mapper for block B, the reader will skip and not process data until it > sees the start of a record. So you end up getting all of your XML records > processed (no duplication) and done in parallel. > > Does that make sense? > > -Mike > > > > Date: Tue, 22 Nov 2011 03:08:20 +0000 > > From: [EMAIL PROTECTED] > > Subject: RE: Regarding loading a big XML file to HDFS > > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > > > > Also i am surprising, how you are writing mapreduce application here. > Map and reduce will work with key value pairs. > > ________________________________________ > > From: Uma Maheswara Rao G > > Sent: Tuesday, November 22, 2011 8:33 AM > > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > > Subject: RE: Regarding loading a big XML file to HDFS > > > > >______________________________________ > > >From: hari708 [[EMAIL PROTECTED]] > > >Sent: Tuesday, November 22, 2011 6:50 AM > > >To: [EMAIL PROTECTED] > > >Subject: Regarding loading a big XML file to HDFS > > > > >Hi, > > >I have a big file consisting of XML data.the XML is not represented as a > > >single line in the file. if we stream this file using ./hadoop dfs -put > > >command to a hadoop directory .How the distribution happens.? > > > > HDFS will didvide the blocks based on your block size configured for the > file. > > > > >Basically in My mapreduce program i am expecting a complete XML as my > > >input.i have a CustomReader(for XML) in my mapreduce job > configuration.My > > >main confusion is if namenode distribute data to DataNodes ,there is a > > >chance that a part of xml can go to one data node and other half can go > in > > >another datanode.If that is the case will my custom XMLReader in the > > >mapreduce be able to combine it(as mapreduce reads data locally only). > > >Please help me on this? > > > > if you can not do anything parallel here, make your input split size to > cover complete file size. > > also configure the block size to cover complete file size. In this case, > only one mapper and reducer will be spawned for file. But here you wont get > any parallel processing advantage. > > > > >-- > > >View this message in context: > http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS- > >tp32871900p32871900.html > > >Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > -- -- Inder
-
Re: Regarding loading a big XML file to HDFSBejoy Ks 2011-11-22, 07:33
Hi All
I'm sharing my understanding here. Please correct me if I'm wrong (Uma and Michael). The explanation by Michael is the common working of map reduce programs I believe. Just take case of a common text file of size 96MB and if my HDFS block size is 64 MB then this file would be split across 2 blocks block A(64 MB) and block B(32 MB). This splitting and storing in hdfs would be happening just based on the size and never based on any end of line characters. Which means that the last line may not be completely in block A , part in Block A and rest in block B. Now the file is stored in HDFS this way. When we try to process the HDFS stored file using map reduce (say using default TextInputFormat) there would be two mappers spanned by JT, mapper-A and mapper-B. Mapper-A would be reading Block A and when it reaches the last line it wont be getting the line delimiter so it would read the details till the first line delimiter in Block B. Mapper B would start processing Block B only from the first line delimiter. Now the mappers understands whether the blocks that they are reading are the first block or intermediate blocks of a file from the offset, if offset is 0000 then it is the first block of a file. Please add on if there are more parameters considered for the same other than just offset like some meta information as well. So we don't need a custom input format/record reader here for the default behavior to read end of a line/record. Such a processing would hardly make sense while processing complex xmls as xmls are based fully on parent child relation ship. (it would work well for simple XMLs just having one level of hirearchy). Say for example consider the mock XML like below <Vehicle> <Car> <BMW> <Sedan> <3-Series> <min-torque></min-torque> ----------------------------------------------------------------------------------------------------------------------------------- <max-torque></max-torque> </3-Series <Sedan> <SUV> </SUV </BMW> </Car> <Truck> </Truck> <Bus> <Bus> </Vehicle> Even if we split it in between(even if split happens at a line boundary) it would be hard to process as the opening tags come in one block under one mapper's boundary and the closing tags come in another block under another mapper's boundary. So if we are mining some data from them it hardly makes sense. We need to incorporate the logic in here interns of regex or so to identify the closing tags from second block, May be one query remains, why use map reduce for XML if we can't exploit parallel processing? - We can process multiple small xml files in parallel one in each mapper without splitting to mine and extract some information for processing. But we lose a good extent of data locality here. There is a sample user defined input format given in Hadoop Definitive Guide called WholeFileInputFormat which would satisfy this purpose. - For larger xml files we have to consider processing the splits in parallel itself. There is a default class provided in hadoop for the same, StreamXmlRecordReader which can be used outside of steaming as well. For details i have posted the http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html Hope it helps!.. Regards Bejoy.K.S On Tue, Nov 22, 2011 at 9:31 AM, Inder Pall <[EMAIL PROTECTED]> wrote: > what about the records at skipped boundaries? > Instead is there a way to define a custom splitter in hadoop which can > understand record boundaries. > > - Inder > > On Tue, Nov 22, 2011 at 9:28 AM, Michael Segel <[EMAIL PROTECTED] > >wrote: > > > > > Just wanted to address this: > > > >Basically in My mapreduce program i am expecting a complete XML as my > > > >input.i have a CustomReader(for XML) in my mapreduce job > > configuration.My > > > >main confusion is if namenode distribute data to DataNodes ,there is a
-
Re: Regarding loading a big XML file to HDFSSteve Loughran 2011-11-22, 11:19
On 22/11/11 07:33, Bejoy Ks wrote:
> Such a processing would hardly make sense while processing > complex xmls as xmls are based fully on parent child relation ship. (it > would work well for simple XMLs just having one level of hirearchy). that is provided nobody is doing XML namespace declarations <m1:vehicle xmlns:xml="uri:model1" xmlns="uri:model2> <car > ... </car> </m1:vehicle> In such a world the vehicle element name is the tuple ("uri:model1", "vehicle") but that of the nested element is ("uri1:model2","car") The way XML namespace handling is done implies the entire parent tree needs to be parsed before you can be confident of the namespace which an XML element and attributes belong to. Say > for example consider the mock XML like below > > <Vehicle> > <Car> > <BMW> > <Sedan> > <3-Series> > <min-torque></min-torque> > ----------------------------------------------------------------------------------------------------------------------------------- > <max-torque></max-torque> > </3-Series > <Sedan> > <SUV> > </SUV > </BMW> > </Car> > <Truck> > </Truck> > <Bus> > <Bus> > </Vehicle> > > Even if we split it in between(even if split happens at a line boundary) > it would be hard to process as the opening tags come in one block under one > mapper's boundary and the closing tags come in another block under another > mapper's boundary. So if we are mining some data from them it hardly makes > sense. most record scans pull it a bit of trailing data from the next block; it's generally not very much and not worth worrying about. Collect some data on average record length and assume that as your usual over-read. >We need to incorporate the logic in here interns of regex or so to > identify the closing tags from second block, regexps which invariably contain assumptions about the encoding of content within the XML document, break if the doctype is UTF-16 or something else, and are still namespace-brittle. > May be one query remains, why use map reduce for XML if we can't exploit > parallel processing? Why use XML for your persistent format if you can only parse it through a (stateful) recursive process, so limiting you to the bandwidth of your parser accessing a single file? > - We can process multiple small xml files in parallel one in each mapper > without splitting to mine and extract some information for processing. But > we lose a good extent of data locality here. no, you aggregate lots of small XML records into a HAR
-
Re: Regarding loading a big XML file to HDFSJoey Echeverria 2011-11-22, 11:20
If your file is bigger than a block size (typically 64mb or 128mb), then it will be split into more than one block. The blocks may or may not be stored on different datanodes. If you're using a default InputFormat, then the input will be split between two task. Since you said you need the whole file in order to process it, you should use either a custom InputFormat that doesn't split or use something like WholeFileInputFormat which returns the whole file s a single record.
-Joey On Nov 21, 2011, at 20:20, hari708 <[EMAIL PROTECTED]> wrote: > > Hi, > I have a big file consisting of XML data.the XML is not represented as a > single line in the file. if we stream this file using ./hadoop dfs -put > command to a hadoop directory .How the distribution happens.? > Basically in My mapreduce program i am expecting a complete XML as my > input.i have a CustomReader(for XML) in my mapreduce job configuration.My > main confusion is if namenode distribute data to DataNodes ,there is a > chance that a part of xml can go to one data node and other half can go in > another datanode.If that is the case will my custom XMLReader in the > mapreduce be able to combine it(as mapreduce reads data locally only). > Please help me on this? > -- > View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS-tp32871901p32871901.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. >
-
Re: Regarding loading a big XML file to HDFSMridul Muralidharan 2011-11-22, 12:48
You cannot determine start of an xml document from a collection of xml documents (in the dfs file) if you start at some arbitrary point within it the collection (unless some data specific hints are used). Regards, Mridul On Tuesday 22 November 2011 09:28 AM, Michael Segel wrote: > > Just wanted to address this: >>> Basically in My mapreduce program i am expecting a complete XML as my >>> input.i have a CustomReader(for XML) in my mapreduce job configuration.My >>> main confusion is if namenode distribute data to DataNodes ,there is a >>> chance that a part of xml can go to one data node and other half can go in >>> another datanode.If that is the case will my custom XMLReader in the >>> mapreduce be able to combine it(as mapreduce reads data locally only). >>> Please help me on this? >> >> if you can not do anything parallel here, make your input split size to cover complete file size. >> > also configure the block size to cover complete file size. In this > case, only one mapper and reducer will be spawned for file. But here you > wont get any parallel processing advantage. >> > > You can do this in parallel. > You need to write a custom input format class. (Which is what you're already doing...) > > Lets see if I can explain this correctly. > You have an XML record split across block A and block B. > > Your map reduce job will instantiate a task per block. > So in mapper processing block A, you read and process the XML records... when you get to the last record, which is only in part of A, mapper A will continue on to block B and continue reading the last record. Then stops. > In mapper for block B, the reader will skip and not process data until it sees the start of a record. So you end up getting all of your XML records processed (no duplication) and done in parallel. > > Does that make sense? > > -Mike > > >> Date: Tue, 22 Nov 2011 03:08:20 +0000 >> From: [EMAIL PROTECTED] >> Subject: RE: Regarding loading a big XML file to HDFS >> To: [EMAIL PROTECTED]; [EMAIL PROTECTED] >> >> Also i am surprising, how you are writing mapreduce application here. Map and reduce will work with key value pairs. >> ________________________________________ >> From: Uma Maheswara Rao G >> Sent: Tuesday, November 22, 2011 8:33 AM >> To: [EMAIL PROTECTED]; [EMAIL PROTECTED] >> Subject: RE: Regarding loading a big XML file to HDFS >> >>> ______________________________________ >>> From: hari708 [[EMAIL PROTECTED]] >>> Sent: Tuesday, November 22, 2011 6:50 AM >>> To: [EMAIL PROTECTED] >>> Subject: Regarding loading a big XML file to HDFS >> >>> Hi, >>> I have a big file consisting of XML data.the XML is not represented as a >>> single line in the file. if we stream this file using ./hadoop dfs -put >>> command to a hadoop directory .How the distribution happens.? >> >> HDFS will didvide the blocks based on your block size configured for the file. >> >>> Basically in My mapreduce program i am expecting a complete XML as my >>> input.i have a CustomReader(for XML) in my mapreduce job configuration.My >>> main confusion is if namenode distribute data to DataNodes ,there is a >>> chance that a part of xml can go to one data node and other half can go in >>> another datanode.If that is the case will my custom XMLReader in the >>> mapreduce be able to combine it(as mapreduce reads data locally only). >>> Please help me on this? >> >> if you can not do anything parallel here, make your input split size to cover complete file size. >> also configure the block size to cover complete file size. In this case, only one mapper and reducer will be spawned for file. But here you wont get any parallel processing advantage. >> >>> -- >>> View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS->tp32871900p32871900.html >>> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> > |