|
Perko, Ralph J
2012-06-06, 20:20
William Slacum
2012-06-07, 01:57
David Medinets
2012-06-07, 02:06
William Slacum
2012-06-07, 02:19
Josh Elser
2012-06-07, 02:50
David Medinets
2012-06-07, 11:29
Josh Elser
2012-06-07, 12:06
Perko, Ralph J
2012-06-07, 14:48
Josh Elser
2012-06-08, 01:57
Eric Newton
2012-06-08, 02:27
Josh Elser
2012-06-08, 02:30
Perko, Ralph J
2012-06-08, 14:47
|
-
XML Storage - Accumulo or HDFSPerko, Ralph J 2012-06-06, 20:20
Hi, I am working with large chunks of XML, anywhere from 1 – 50 GB each. I am running several different MapReduce jobs on the XML to pull out various pieces of data, do analytics, etc. I am using an XML input type based on the WikipediaInputFormat from the examples. What I have been doing is 1) loading the entire XML into HDFS as a single document 2) parsing the XML on some tag <foo> and storing each one of these instances as the content of a new row in Accumulo, using the name of the instance as the row id. I then run other MR jobs that scan this table, pull out and parse the XML and do whatever I need to do with the data.
My question is, is there any advantage to storing the XML in Accumulo versus just leaving it in HDFS and parsing it from there? Either as a large block of XML or as individual chunks, perhaps using Hadoop Archive to handle the small-file problem? The actual XML will not be queried in and of itself but is part other analysis processes. Thanks, Ralph __________________________________________________ Ralph Perko Pacific Northwest National Laboratory +
Perko, Ralph J 2012-06-06, 20:20
-
Re: XML Storage - Accumulo or HDFSWilliam Slacum 2012-06-07, 01:57
There are advantages to using Accumulo to store the contents of your
XML documents, depending on their structure and what you want to end up taking out of them. Are you trying to emulate the document store pattern that the Wikipedia example uses? On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J <[EMAIL PROTECTED]> wrote: > Hi, I am working with large chunks of XML, anywhere from 1 – 50 GB each. I am running several different MapReduce jobs on the XML to pull out various pieces of data, do analytics, etc. I am using an XML input type based on the WikipediaInputFormat from the examples. What I have been doing is 1) loading the entire XML into HDFS as a single document 2) parsing the XML on some tag <foo> and storing each one of these instances as the content of a new row in Accumulo, using the name of the instance as the row id. I then run other MR jobs that scan this table, pull out and parse the XML and do whatever I need to do with the data. > > My question is, is there any advantage to storing the XML in Accumulo versus just leaving it in HDFS and parsing it from there? Either as a large block of XML or as individual chunks, perhaps using Hadoop Archive to handle the small-file problem? The actual XML will not be queried in and of itself but is part other analysis processes. > > Thanks, > Ralph > > > __________________________________________________ > Ralph Perko > Pacific Northwest National Laboratory > > +
William Slacum 2012-06-07, 01:57
-
Re: XML Storage - Accumulo or HDFSDavid Medinets 2012-06-07, 02:06
I can't think of any advantage to storing XML inside Accumulo. I am
interested to learn some details about your view. Storing the extracted information and the location of the HDFS file that sourced the information does make sense to me. In fact, it might be useful to store file positions in Accumulo so it's easy to get back to specific spots in the XML file. For example, if you had an XML file with many records in it and there was no reason to immediately decompose each record. On Wed, Jun 6, 2012 at 9:57 PM, William Slacum <[EMAIL PROTECTED]> wrote: > There are advantages to using Accumulo to store the contents of your > XML documents, depending on their structure and what you want to end > up taking out of them. Are you trying to emulate the document store > pattern that the Wikipedia example uses? > > On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J <[EMAIL PROTECTED]> wrote: >> Hi, I am working with large chunks of XML, anywhere from 1 – 50 GB each. I am running several different MapReduce jobs on the XML to pull out various pieces of data, do analytics, etc. I am using an XML input type based on the WikipediaInputFormat from the examples. What I have been doing is 1) loading the entire XML into HDFS as a single document 2) parsing the XML on some tag <foo> and storing each one of these instances as the content of a new row in Accumulo, using the name of the instance as the row id. I then run other MR jobs that scan this table, pull out and parse the XML and do whatever I need to do with the data. >> >> My question is, is there any advantage to storing the XML in Accumulo versus just leaving it in HDFS and parsing it from there? Either as a large block of XML or as individual chunks, perhaps using Hadoop Archive to handle the small-file problem? The actual XML will not be queried in and of itself but is part other analysis processes. >> >> Thanks, >> Ralph >> >> >> __________________________________________________ >> Ralph Perko >> Pacific Northwest National Laboratory >> >> +
David Medinets 2012-06-07, 02:06
-
Re: XML Storage - Accumulo or HDFSWilliam Slacum 2012-06-07, 02:19
If your XML documents are really just lists of elements/objects, and
what you want to run your analytics on are subsets of those elements (even across XML documents), then it makes sense to take a document store approach similar to what the Wikipedia example has done. This allows you to index specific portions of elements, create graphs and apply visibility labels to specific attributes in a given object tree. On Wed, Jun 6, 2012 at 10:06 PM, David Medinets <[EMAIL PROTECTED]> wrote: > I can't think of any advantage to storing XML inside Accumulo. I am > interested to learn some details about your view. Storing the > extracted information and the location of the HDFS file that sourced > the information does make sense to me. In fact, it might be useful to > store file positions in Accumulo so it's easy to get back to specific > spots in the XML file. For example, if you had an XML file with many > records in it and there was no reason to immediately decompose each > record. > > On Wed, Jun 6, 2012 at 9:57 PM, William Slacum <[EMAIL PROTECTED]> wrote: >> There are advantages to using Accumulo to store the contents of your >> XML documents, depending on their structure and what you want to end >> up taking out of them. Are you trying to emulate the document store >> pattern that the Wikipedia example uses? >> >> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J <[EMAIL PROTECTED]> wrote: >>> Hi, I am working with large chunks of XML, anywhere from 1 – 50 GB each. I am running several different MapReduce jobs on the XML to pull out various pieces of data, do analytics, etc. I am using an XML input type based on the WikipediaInputFormat from the examples. What I have been doing is 1) loading the entire XML into HDFS as a single document 2) parsing the XML on some tag <foo> and storing each one of these instances as the content of a new row in Accumulo, using the name of the instance as the row id. I then run other MR jobs that scan this table, pull out and parse the XML and do whatever I need to do with the data. >>> >>> My question is, is there any advantage to storing the XML in Accumulo versus just leaving it in HDFS and parsing it from there? Either as a large block of XML or as individual chunks, perhaps using Hadoop Archive to handle the small-file problem? The actual XML will not be queried in and of itself but is part other analysis processes. >>> >>> Thanks, >>> Ralph >>> >>> >>> __________________________________________________ >>> Ralph Perko >>> Pacific Northwest National Laboratory >>> >>> +
William Slacum 2012-06-07, 02:19
-
Re: XML Storage - Accumulo or HDFSJosh Elser 2012-06-07, 02:50
+1, Bill. Assuming you aren't doing anything crazy in your XML files,
the wikipedia example should get you pretty far. That being said, the structure used in the wikipedia example doesn't handle large lists of elements -- short explanation: an attribute of a document is stored as one key-vale pair, so if you have lot of large lists, you inflate the key which does bad things. That in mind, there are small changes you can make to the table structure to store those lists more efficiently and still maintain the semantic representation (Bill's graph comment). David, ignoring any issues of data locality of the blocks in your large XML files, storing byte offsets into a hierarchical data structure (XML) seems like a sub-optimal solution to me. Aside from losing the hierarchy knowledge, if you have a skewed distribution of elements in the XML document, you can't get good locality in your query/analytic. What was your idea behind storing the offsets? - Josh On 6/6/2012 10:19 PM, William Slacum wrote: > If your XML documents are really just lists of elements/objects, and > what you want to run your analytics on are subsets of those elements > (even across XML documents), then it makes sense to take a document > store approach similar to what the Wikipedia example has done. This > allows you to index specific portions of elements, create graphs and > apply visibility labels to specific attributes in a given object tree. > > On Wed, Jun 6, 2012 at 10:06 PM, David Medinets > <[EMAIL PROTECTED]> wrote: >> I can't think of any advantage to storing XML inside Accumulo. I am >> interested to learn some details about your view. Storing the >> extracted information and the location of the HDFS file that sourced >> the information does make sense to me. In fact, it might be useful to >> store file positions in Accumulo so it's easy to get back to specific >> spots in the XML file. For example, if you had an XML file with many >> records in it and there was no reason to immediately decompose each >> record. >> >> On Wed, Jun 6, 2012 at 9:57 PM, William Slacum<[EMAIL PROTECTED]> wrote: >>> There are advantages to using Accumulo to store the contents of your >>> XML documents, depending on their structure and what you want to end >>> up taking out of them. Are you trying to emulate the document store >>> pattern that the Wikipedia example uses? >>> >>> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J<[EMAIL PROTECTED]> wrote: >>>> Hi, I am working with large chunks of XML, anywhere from 1 � 50 GB each. I am running several different MapReduce jobs on the XML to pull out various pieces of data, do analytics, etc. I am using an XML input type based on the WikipediaInputFormat from the examples. What I have been doing is 1) loading the entire XML into HDFS as a single document 2) parsing the XML on some tag<foo> and storing each one of these instances as the content of a new row in Accumulo, using the name of the instance as the row id. I then run other MR jobs that scan this table, pull out and parse the XML and do whatever I need to do with the data. >>>> >>>> My question is, is there any advantage to storing the XML in Accumulo versus just leaving it in HDFS and parsing it from there? Either as a large block of XML or as individual chunks, perhaps using Hadoop Archive to handle the small-file problem? The actual XML will not be queried in and of itself but is part other analysis processes. >>>> >>>> Thanks, >>>> Ralph >>>> >>>> >>>> __________________________________________________ >>>> Ralph Perko >>>> Pacific Northwest National Laboratory >>>> >>>> +
Josh Elser 2012-06-07, 02:50
-
Re: XML Storage - Accumulo or HDFSDavid Medinets 2012-06-07, 11:29
On Wed, Jun 6, 2012 at 10:50 PM, Josh Elser <[EMAIL PROTECTED]> wrote:
> Aside from losing the hierarchy > knowledge, if you have a skewed distribution of elements in the XML > document, you can't get good locality in your query/analytic. What was your > idea behind storing the offsets? <RECORDS> <RECORD> <KEY_FIELD/> <TAG/> </RECORD> <RECORD> <KEY_FIELD/> <TAG/> </RECORD> </RECORDS> My XML looks like that. I don't know how the information in the XML will be used in the future and I don't want to re-scan large numbers of XML to find a single record. For example, yesterday we found a potential bug. My bug analysis showed the source data was in record X of 450,000 records. Since I know which XML file held that record, I was able to get that file locally and use command-line tools to find surrounding information. My XML file might have 200 tags but normally I only need 45 of them. My XML is without hierarchy. +
David Medinets 2012-06-07, 11:29
-
Re: XML Storage - Accumulo or HDFSJosh Elser 2012-06-07, 12:06
So, if your XML looks like the snippet you posted, it's extremely easy
to fetch records based on the KEY_FIELD or TAG element. A (relatively) flat XML document is rather trivial to map into the wikipedia example. As I was saying previously, it gets trickier when you have a deep or a deep and wide structure. If your requirement is to find the /n/ RECORDs before and after a given RECORD, then yes, the wiki example wouldn't make much sense; however, you could add an attribute to each RECORD to denote positional information in the original file which would alleviate this problem. From an application sense, it usually doesn't make sense to index documents purely off of their positional information in the source data (as you suggested using the byte file offsets) because that's not how you're going to want to query it in your application. I would assume you'd want to be querying off of KEY_FIELD or TAG. - Josh On 6/7/12 7:29 AM, David Medinets wrote: > On Wed, Jun 6, 2012 at 10:50 PM, Josh Elser<[EMAIL PROTECTED]> wrote: >> Aside from losing the hierarchy >> knowledge, if you have a skewed distribution of elements in the XML >> document, you can't get good locality in your query/analytic. What was your >> idea behind storing the offsets? > <RECORDS> > <RECORD> > <KEY_FIELD/> > <TAG/> > </RECORD> > <RECORD> > <KEY_FIELD/> > <TAG/> > </RECORD> > </RECORDS> > > My XML looks like that. I don't know how the information in the XML > will be used in the future and I don't want to re-scan large numbers > of XML to find a single record. For example, yesterday we found a > potential bug. My bug analysis showed the source data was in record X > of 450,000 records. Since I know which XML file held that record, I > was able to get that file locally and use command-line tools to find > surrounding information. My XML file might have 200 tags but normally > I only need 45 of them. My XML is without hierarchy. +
Josh Elser 2012-06-07, 12:06
-
Re: XML Storage - Accumulo or HDFSPerko, Ralph J 2012-06-07, 14:48
My use-case is very similar to the Wikipedia example. I'm not sure what
you mean by the inflated key. Can you expand on that? I am not really pulling out individual elements/attributes to simply store them apart from the XML. Any element I pull out is part of a larger analytic process and it is this result I store. I am doing some graph worked based on relationships between elements. Example: <books> <book> <title>basket weaving</title> <author>bob</author> <toc>Š</toc> <chapter number=1>lots of text here</chapter> <chapter number=2>even more text here</chapter> <citation>another book</citation> </book> </books> Each "book" is a record. The book title is the row id. The content is the XML <book>..</book> My table then has other columns such as "word count" or "character count" stored in the table. Table example: Row: basket weaving Col family: content Col qual: xml Value: <book>Š</book> Row: basket weaving Col family: metrics Col qual: word count Value: 12345 Row: basket weaving Col family:cites Col qual: another book Value: -- nothing meaningful Row: another book Col family:cited by Col qual: basket weaving Value: -- nothing meaningful I use the "cites" and "cited by" qualifiers for graphs On 6/6/12 7:50 PM, "Josh Elser" <[EMAIL PROTECTED]> wrote: >+1, Bill. Assuming you aren't doing anything crazy in your XML files, >the wikipedia example should get you pretty far. That being said, the >structure used in the wikipedia example doesn't handle large lists of >elements -- short explanation: an attribute of a document is stored as >one key-vale pair, so if you have lot of large lists, you inflate the >key which does bad things. That in mind, there are small changes you can >make to the table structure to store those lists more efficiently and >still maintain the semantic representation (Bill's graph comment). > >David, ignoring any issues of data locality of the blocks in your large >XML files, storing byte offsets into a hierarchical data structure (XML) >seems like a sub-optimal solution to me. Aside from losing the hierarchy >knowledge, if you have a skewed distribution of elements in the XML >document, you can't get good locality in your query/analytic. What was >your idea behind storing the offsets? > >- Josh > >On 6/6/2012 10:19 PM, William Slacum wrote: >> If your XML documents are really just lists of elements/objects, and >> what you want to run your analytics on are subsets of those elements >> (even across XML documents), then it makes sense to take a document >> store approach similar to what the Wikipedia example has done. This >> allows you to index specific portions of elements, create graphs and >> apply visibility labels to specific attributes in a given object tree. >> >> On Wed, Jun 6, 2012 at 10:06 PM, David Medinets >> <[EMAIL PROTECTED]> wrote: >>> I can't think of any advantage to storing XML inside Accumulo. I am >>> interested to learn some details about your view. Storing the >>> extracted information and the location of the HDFS file that sourced >>> the information does make sense to me. In fact, it might be useful to >>> store file positions in Accumulo so it's easy to get back to specific >>> spots in the XML file. For example, if you had an XML file with many >>> records in it and there was no reason to immediately decompose each >>> record. >>> >>> On Wed, Jun 6, 2012 at 9:57 PM, William Slacum<[EMAIL PROTECTED]> >>>wrote: >>>> There are advantages to using Accumulo to store the contents of your >>>> XML documents, depending on their structure and what you want to end >>>> up taking out of them. Are you trying to emulate the document store >>>> pattern that the Wikipedia example uses? >>>> >>>> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J<[EMAIL PROTECTED]> >>>>wrote: >>>>> Hi, I am working with large chunks of XML, anywhere from 1 50 GB >>>>>each. I am running several different MapReduce jobs on the XML to >>>>>pull out various pieces of data, do analytics, etc. I am using an +
Perko, Ralph J 2012-06-07, 14:48
-
Re: XML Storage - Accumulo or HDFSJosh Elser 2012-06-08, 01:57
My "inflated key" comment, I'll pull from Eric Newton's comment on the
"Table design" thread: "Accumulo will accomodate keys that are very large (like 100K) but I don't recommend it. It makes indexes big and slows down just about every operation" As applied to your example, you might generate the following keys if you took the wikisearch approach: # Represent your document as such: the row "4" being an arbitrary bucket, and the CF "1234abcd" being some unique identifier for your document (a hash of <book> for example) 4 1234abcd:title\x00basket weaving 4 1234abcd:author\x00bob 4 1234abcd:toc\x00stuff 4 1234abcd:citation\x00another book # Then some indices inside the same row (bucket), creating an in-partition index over the fields of your data. You could also shove the tokenized content from your chapters in here. 4 fi\x00title:basket weaving\x001234abcd 4 fi\x00author:bob\x001234abcd 4 fi\x00toc:stuff\x001234abcd 4 fi\x00citation:another book\x001234abcd # For those big chapters, store them off to the side, perhaps in their own locality group. Will keep this data in separate files. 4 chapters:1234abcd\x001 Value:byte[chapter one data] 4 chapters:1234abcd\x002 Value:byte[chapter two data] # Then perhaps some records pointing to data you expect users to query on in a separate table (inverted index) basket weaving title:4\x001234abcd bob author:4\x001234abcd another book citation:4\x001234abcd - Josh On 6/7/2012 10:48 AM, Perko, Ralph J wrote: > My use-case is very similar to the Wikipedia example. I'm not sure what > you mean by the inflated key. Can you expand on that? I am not really > pulling out individual elements/attributes to simply store them apart from > the XML. Any element I pull out is part of a larger analytic process and > it is this result I store. I am doing some graph worked based on > relationships between elements. > > Example: > > <books> > <book> > <title>basket weaving</title> > <author>bob</author> > <toc>�</toc> > <chapter number=1>lots of text here</chapter> > <chapter number=2>even more text here</chapter> > <citation>another book</citation> > </book> > </books> > > > Each "book" is a record. The book title is the row id. The content is > the XML<book>..</book> > > My table then has other columns such as "word count" or "character count" > stored in the table. > > Table example: > > Row: basket weaving > Col family: content > Col qual: xml > Value:<book>�</book> > > > Row: basket weaving > Col family: metrics > Col qual: word count > Value: 12345 > > Row: basket weaving > Col family:cites > Col qual: another book > Value: -- nothing meaningful > > > Row: another book > Col family:cited by > Col qual: basket weaving > Value: -- nothing meaningful > > I use the "cites" and "cited by" qualifiers for graphs > > > > On 6/6/12 7:50 PM, "Josh Elser"<[EMAIL PROTECTED]> wrote: > >> +1, Bill. Assuming you aren't doing anything crazy in your XML files, >> the wikipedia example should get you pretty far. That being said, the >> structure used in the wikipedia example doesn't handle large lists of >> elements -- short explanation: an attribute of a document is stored as >> one key-vale pair, so if you have lot of large lists, you inflate the >> key which does bad things. That in mind, there are small changes you can >> make to the table structure to store those lists more efficiently and >> still maintain the semantic representation (Bill's graph comment). >> >> David, ignoring any issues of data locality of the blocks in your large >> XML files, storing byte offsets into a hierarchical data structure (XML) >> seems like a sub-optimal solution to me. Aside from losing the hierarchy >> knowledge, if you have a skewed distribution of elements in the XML >> document, you can't get good locality in your query/analytic. What was >> your idea behind storing the offsets? >> >> - Josh >> >> On 6/6/2012 10:19 PM, William Slacum wrote: +
Josh Elser 2012-06-08, 01:57
-
Re: XML Storage - Accumulo or HDFSEric Newton 2012-06-08, 02:27
Minor correction:
On Thu, Jun 7, 2012 at 9:57 PM, Josh Elser <[EMAIL PROTECTED]> wrote: > # For those big chapters, store them off to the side, perhaps in their own > locality group. Will keep this data in separate files. > It is useful to think that accumulo stores locality groups in separate files; but it doesn't. They are stored in a separate section of the same file. The difference is that we don't create more files for the NameNode to manage. +
Eric Newton 2012-06-08, 02:27
-
Re: XML Storage - Accumulo or HDFSJosh Elser 2012-06-08, 02:30
I learned something new today :D. Thanks, Eric.
On 6/7/2012 10:27 PM, Eric Newton wrote: > Minor correction: > > On Thu, Jun 7, 2012 at 9:57 PM, Josh Elser <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > # For those big chapters, store them off to the side, perhaps in > their own locality group. Will keep this data in separate files. > > > It is useful to think that accumulo stores locality groups in separate > files; but it doesn't. They are stored in a separate section of the > same file. The difference is that we don't create more files for the > NameNode to manage. > +
Josh Elser 2012-06-08, 02:30
-
Re: XML Storage - Accumulo or HDFSPerko, Ralph J 2012-06-08, 14:47
Thank you for the help!
On 6/7/12 7:30 PM, "Josh Elser" <[EMAIL PROTECTED]> wrote: >I learned something new today :D. Thanks, Eric. > >On 6/7/2012 10:27 PM, Eric Newton wrote: >> Minor correction: >> >> On Thu, Jun 7, 2012 at 9:57 PM, Josh Elser <[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]>> wrote: >> >> # For those big chapters, store them off to the side, perhaps in >> their own locality group. Will keep this data in separate files. >> >> >> It is useful to think that accumulo stores locality groups in separate >> files; but it doesn't. They are stored in a separate section of the >> same file. The difference is that we don't create more files for the >> NameNode to manage. >> +
Perko, Ralph J 2012-06-08, 14:47
|