Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> XML Storage - Accumulo or HDFS

Perko, Ralph J 2012-06-06, 20:20
William Slacum 2012-06-07, 01:57
David Medinets 2012-06-07, 02:06
Copy link to this message
Re: XML Storage - Accumulo or HDFS
If your XML documents are really just lists of elements/objects, and
what you want to run your analytics on are subsets of those elements
(even across XML documents), then it makes sense to take a document
store approach similar to what the Wikipedia example has done. This
allows you to index specific portions of elements, create graphs and
apply visibility labels to specific attributes in a given object tree.

On Wed, Jun 6, 2012 at 10:06 PM, David Medinets
> I can't think of any advantage to storing XML inside Accumulo. I am
> interested to learn some details about your view. Storing the
> extracted information and the location of the HDFS file that sourced
> the information does make sense to me. In fact, it might be useful to
> store file positions in Accumulo so it's easy to get back to specific
> spots in the XML file. For example, if you had an XML file with many
> records in it and there was no reason to immediately decompose each
> record.
> On Wed, Jun 6, 2012 at 9:57 PM, William Slacum <[EMAIL PROTECTED]> wrote:
>> There are advantages to using Accumulo to store the contents of your
>> XML documents, depending on their structure and what you want to end
>> up taking out of them. Are you trying to emulate the document store
>> pattern that the Wikipedia example uses?
>> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J <[EMAIL PROTECTED]> wrote:
>>> Hi,  I am working with large chunks of XML, anywhere from 1 – 50 GB each.  I am running several different MapReduce jobs on the XML to pull out various pieces of data, do analytics, etc.  I am using an XML input type based on the WikipediaInputFormat from the examples.  What I have been doing is 1) loading the entire XML into HDFS as a single document 2) parsing the XML on some tag <foo> and storing each one of these instances as the content of a new row in Accumulo, using the name of the instance as the row id.  I then run other MR jobs that scan this table, pull out and parse the XML and do whatever I need to do with the data.
>>> My question is, is there any advantage to storing the XML in Accumulo versus just leaving it in HDFS and parsing it from there?  Either as a large block of XML or as individual chunks, perhaps  using Hadoop Archive to handle the small-file problem?  The actual XML will not be queried in and of itself but is part other analysis processes.
>>> Thanks,
>>> Ralph
>>> __________________________________________________
>>> Ralph Perko
>>> Pacific Northwest National Laboratory
Josh Elser 2012-06-07, 02:50
David Medinets 2012-06-07, 11:29
Josh Elser 2012-06-07, 12:06
Perko, Ralph J 2012-06-07, 14:48
Josh Elser 2012-06-08, 01:57
Eric Newton 2012-06-08, 02:27
Josh Elser 2012-06-08, 02:30
Perko, Ralph J 2012-06-08, 14:47