If your XML documents are really just lists of elements/objects, and
what you want to run your analytics on are subsets of those elements
(even across XML documents), then it makes sense to take a document
store approach similar to what the Wikipedia example has done. This
allows you to index specific portions of elements, create graphs and
apply visibility labels to specific attributes in a given object tree.
On Wed, Jun 6, 2012 at 10:06 PM, David Medinets
<[EMAIL PROTECTED]> wrote:
> I can't think of any advantage to storing XML inside Accumulo. I am
> interested to learn some details about your view. Storing the
> extracted information and the location of the HDFS file that sourced
> the information does make sense to me. In fact, it might be useful to
> store file positions in Accumulo so it's easy to get back to specific
> spots in the XML file. For example, if you had an XML file with many
> records in it and there was no reason to immediately decompose each
> On Wed, Jun 6, 2012 at 9:57 PM, William Slacum <[EMAIL PROTECTED]> wrote:
>> There are advantages to using Accumulo to store the contents of your
>> XML documents, depending on their structure and what you want to end
>> up taking out of them. Are you trying to emulate the document store
>> pattern that the Wikipedia example uses?
>> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J <[EMAIL PROTECTED]> wrote:
>>> Hi, I am working with large chunks of XML, anywhere from 1 – 50 GB each. I am running several different MapReduce jobs on the XML to pull out various pieces of data, do analytics, etc. I am using an XML input type based on the WikipediaInputFormat from the examples. What I have been doing is 1) loading the entire XML into HDFS as a single document 2) parsing the XML on some tag <foo> and storing each one of these instances as the content of a new row in Accumulo, using the name of the instance as the row id. I then run other MR jobs that scan this table, pull out and parse the XML and do whatever I need to do with the data.
>>> My question is, is there any advantage to storing the XML in Accumulo versus just leaving it in HDFS and parsing it from there? Either as a large block of XML or as individual chunks, perhaps using Hadoop Archive to handle the small-file problem? The actual XML will not be queried in and of itself but is part other analysis processes.
>>> Ralph Perko
>>> Pacific Northwest National Laboratory