There's also the concern of elements of the document that are too large by
themselves. A general purpose streaming solution would include support for
any kind of objects passed in, not just XML with small elements. I think
the fact that it is an XML document is probably a red herring in this case.
In the past, what we have done is solve this on the application side by
breaking up large objects into chunks and then using a key structure that
groups and maintains the order of the chunks. This usually means that we
append a sequence number to the column qualifier using an integer encoding.
The filedata example that Billie referred to does this. Accumulo would
benefit from some sort of general purpose fragmentation solution for
streaming large objects, and an InputStream/OutputStream solution might be
good for that. Sounds like a fun project!
On Mon, Jun 18, 2012 at 2:06 PM, Marc P. <[EMAIL PROTECTED]> wrote:
> I'm sorry, I must be missing something.
> Why does the schema matter? If you were to build keys from all
> attributes and elements, you could, at any point, rebuild the XML
> document. You could store the heirarchy, by virtue of your keys.
> If you were to do that, the previous suggestions would be applicable.
> Realistically, if you stored the entire XML file into a given
> key/value pair, your heap elements will be borne upon thrift reception
> ( at the client ), therefore, streaming would only add complexity and
> additional memory overhead. It wouldn't give you what you want.
> Splitting the file amongst keys can maintain hierarchy, allow you to
> rebuild the XML doc, and store large records into the value.
> On Mon, Jun 18, 2012 at 2:00 PM, David Medinets
> <[EMAIL PROTECTED]> wrote:
> > Thanks for the offer. I thinking of a situation were I don't know the
> > schema ahead of time. For example, a JMS queue that I simply want to
> > store the XML somewhere. And let some other program parse it. This is
> > a thought experiment.
> > On Sun, Jun 17, 2012 at 1:06 PM, Jim Klucar <[EMAIL PROTECTED]> wrote:
> >> David,
> >> Can you give a taste of the schema of the XML? With that we may be
> >> able to help break the XML file up into keys and help create an index
> >> for it. IMHO that's the power you would get from accumulo. If you just
> >> want it as one big lump, and don't need to search it or only retrieve
> >> portions of the file, then putting it in accumulo is just adding
> >> overhead to hdfs.
> >> Sent from my iPhone
> >> On Jun 17, 2012, at 9:54 AM, David Medinets <[EMAIL PROTECTED]>
> >>> Some of the XML records that I work with are over 50M. I was hoping to
> >>> store them inside of Accumulo instead of the text-based HDFS XML super
> >>> file currently being used. However, since they are so large I can't
> >>> create a Value object without running out of memory. Storing values
> >>> this large may simply be using the wrong tool, please let me know.