Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> XML Storage - Accumulo or HDFS

Perko, Ralph J 2012-06-06, 20:20
William Slacum 2012-06-07, 01:57
David Medinets 2012-06-07, 02:06
William Slacum 2012-06-07, 02:19
Josh Elser 2012-06-07, 02:50
David Medinets 2012-06-07, 11:29
Josh Elser 2012-06-07, 12:06
Perko, Ralph J 2012-06-07, 14:48
Copy link to this message
Re: XML Storage - Accumulo or HDFS
My "inflated key" comment, I'll pull from Eric Newton's comment on the
"Table design" thread:

"Accumulo will accomodate keys that are very large (like 100K) but I
don't recommend it. It makes indexes big and slows down just about every

As applied to your example, you might generate the following keys if you
took the wikisearch approach:

# Represent your document as such: the row "4" being an arbitrary
bucket, and the CF "1234abcd" being some unique identifier for your
document (a hash of <book> for example)

4   1234abcd:title\x00basket weaving
4   1234abcd:author\x00bob
4   1234abcd:toc\x00stuff
4   1234abcd:citation\x00another book

# Then some indices inside the same row (bucket), creating an
in-partition index over the fields of your data. You could also shove
the tokenized content from your chapters in here.
4   fi\x00title:basket weaving\x001234abcd
4   fi\x00author:bob\x001234abcd
4   fi\x00toc:stuff\x001234abcd
4   fi\x00citation:another book\x001234abcd

# For those big chapters, store them off to the side, perhaps in their
own locality group. Will keep this data in separate files.
4 chapters:1234abcd\x001    Value:byte[chapter one data]
4 chapters:1234abcd\x002    Value:byte[chapter two data]

# Then perhaps some records pointing to data you expect users to query
on in a separate table (inverted index)
basket weaving    title:4\x001234abcd
bob    author:4\x001234abcd
another book    citation:4\x001234abcd

- Josh

On 6/7/2012 10:48 AM, Perko, Ralph J wrote:
> My use-case is very similar to the Wikipedia example. I'm not sure what
> you mean by the inflated key.  Can you expand on that?  I am not really
> pulling out individual elements/attributes to simply store them apart from
> the XML.  Any element I pull out is part of a larger analytic process and
> it is this result I store.  I am doing some graph worked based on
> relationships between elements.
> Example:
> <books>
>    <book>
>      <title>basket weaving</title>
>      <author>bob</author>
>      <toc>�</toc>
>      <chapter number=1>lots of text here</chapter>
>      <chapter number=2>even more text here</chapter>
>      <citation>another book</citation>
>    </book>
> </books>
> Each "book" is a record.  The book title is the row id.  The content is
> the XML<book>..</book>
> My table then has other columns such as "word count" or "character count"
> stored in the table.
> Table example:
> Row: basket weaving
> Col family: content
> Col qual: xml
> Value:<book>�</book>
> Row: basket weaving
> Col family: metrics
> Col qual: word count
> Value: 12345
> Row: basket weaving
> Col family:cites
> Col qual: another book
> Value: -- nothing meaningful
> Row: another book
> Col family:cited by
> Col qual: basket weaving
> Value: -- nothing meaningful
> I use the "cites" and "cited by" qualifiers for graphs
> On 6/6/12 7:50 PM, "Josh Elser"<[EMAIL PROTECTED]>  wrote:
>> +1, Bill. Assuming you aren't doing anything crazy in your XML files,
>> the wikipedia example should get you pretty far. That being said, the
>> structure used in the wikipedia example doesn't handle large lists of
>> elements -- short explanation: an attribute of a document is stored as
>> one key-vale pair, so if you have lot of large lists, you inflate the
>> key which does bad things. That in mind, there are small changes you can
>> make to the table structure to store those lists more efficiently and
>> still maintain the semantic representation (Bill's graph comment).
>> David, ignoring any issues of data locality of the blocks in your large
>> XML files, storing byte offsets into a hierarchical data structure (XML)
>> seems like a sub-optimal solution to me. Aside from losing the hierarchy
>> knowledge, if you have a skewed distribution of elements in the XML
>> document, you can't get good locality in your query/analytic. What was
>> your idea behind storing the offsets?
>> - Josh
>> On 6/6/2012 10:19 PM, William Slacum wrote:
Eric Newton 2012-06-08, 02:27
Josh Elser 2012-06-08, 02:30
Perko, Ralph J 2012-06-08, 14:47