We tried using the hadoop streaming xml format a while ago and it didn't quite go as expected. I don't remember why, but, it gave some weird results- missing some records off, getting to 98% complete and then stopping etc.
The Mahout project also has an XmlInputFormat [1] that we ended up using. I also posted something on my blog about it all [2], and a little about my understanding (so far) of input formats and record readers etc.
Hope that helps,
Paul
1.
http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java2.
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.htmlOn 13 Jul 2010, at 12:26, Shuja Rehman wrote:
> Hi Khaled,
> XML files can be processed using hadoop streaming. check out the following
> link.
>
>
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F>
> Regards
> Shuja
>
> On Tue, Jul 13, 2010 at 2:24 PM, edward choi <[EMAIL PROTECTED]> wrote:
>
>> Khaled,
>>
>> Hadoop mapreduce innately takes in file line by line.
>> XML files are not comprised of single lines.
>> So you will have to pack a single xml document into a single line.
>> Or you can make your own input format, which you need to refer to a guide
>> book.
>>
>> 2010/7/13 Khaled BEN BAHRI <[EMAIL PROTECTED]>
>>
>>> Hello to all
>>>
>>> I'm novice in working with mapreduce and i'm developping a mapreduce
>>> function that take xml documents as inputs.
>>>
>>> How can i make input files and precise it to the map function
>>>
>>> Thanks for help
>>>
>>> Best regards
>>> Khaled
>>>
>>>
>>
>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> _________________________________
> MS CS - School of Science and Engineering
> Lahore University of Management Sciences (LUMS)
> Sector U, DHA, Lahore, 54792, Pakistan
> Cell: +92 3214207445