Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> XML Storage - Accumulo or HDFS


Copy link to this message
-
Re: XML Storage - Accumulo or HDFS
My "inflated key" comment, I'll pull from Eric Newton's comment on the
"Table design" thread:

"Accumulo will accomodate keys that are very large (like 100K) but I
don't recommend it. It makes indexes big and slows down just about every
operation"

As applied to your example, you might generate the following keys if you
took the wikisearch approach:

# Represent your document as such: the row "4" being an arbitrary
bucket, and the CF "1234abcd" being some unique identifier for your
document (a hash of <book> for example)

4   1234abcd:title\x00basket weaving
4   1234abcd:author\x00bob
4   1234abcd:toc\x00stuff
4   1234abcd:citation\x00another book

# Then some indices inside the same row (bucket), creating an
in-partition index over the fields of your data. You could also shove
the tokenized content from your chapters in here.
4   fi\x00title:basket weaving\x001234abcd
4   fi\x00author:bob\x001234abcd
4   fi\x00toc:stuff\x001234abcd
4   fi\x00citation:another book\x001234abcd

# For those big chapters, store them off to the side, perhaps in their
own locality group. Will keep this data in separate files.
4 chapters:1234abcd\x001    Value:byte[chapter one data]
4 chapters:1234abcd\x002    Value:byte[chapter two data]

# Then perhaps some records pointing to data you expect users to query
on in a separate table (inverted index)
basket weaving    title:4\x001234abcd
bob    author:4\x001234abcd
another book    citation:4\x001234abcd

- Josh

On 6/7/2012 10:48 AM, Perko, Ralph J wrote:
> My use-case is very similar to the Wikipedia example. I'm not sure what
> you mean by the inflated key.  Can you expand on that?  I am not really
> pulling out individual elements/attributes to simply store them apart from
> the XML.  Any element I pull out is part of a larger analytic process and
> it is this result I store.  I am doing some graph worked based on
> relationships between elements.
>
> Example:
>
> <books>
>    <book>
>      <title>basket weaving</title>
>      <author>bob</author>
>      <toc>�</toc>
>      <chapter number=1>lots of text here</chapter>
>      <chapter number=2>even more text here</chapter>
>      <citation>another book</citation>
>    </book>
> </books>
>
>
> Each "book" is a record.  The book title is the row id.  The content is
> the XML<book>..</book>
>
> My table then has other columns such as "word count" or "character count"
> stored in the table.
>
> Table example:
>
> Row: basket weaving
> Col family: content
> Col qual: xml
> Value:<book>�</book>
>
>
> Row: basket weaving
> Col family: metrics
> Col qual: word count
> Value: 12345
>
> Row: basket weaving
> Col family:cites
> Col qual: another book
> Value: -- nothing meaningful
>
>
> Row: another book
> Col family:cited by
> Col qual: basket weaving
> Value: -- nothing meaningful
>
> I use the "cites" and "cited by" qualifiers for graphs
>
>
>
> On 6/6/12 7:50 PM, "Josh Elser"<[EMAIL PROTECTED]>  wrote:
>
>> +1, Bill. Assuming you aren't doing anything crazy in your XML files,
>> the wikipedia example should get you pretty far. That being said, the
>> structure used in the wikipedia example doesn't handle large lists of
>> elements -- short explanation: an attribute of a document is stored as
>> one key-vale pair, so if you have lot of large lists, you inflate the
>> key which does bad things. That in mind, there are small changes you can
>> make to the table structure to store those lists more efficiently and
>> still maintain the semantic representation (Bill's graph comment).
>>
>> David, ignoring any issues of data locality of the blocks in your large
>> XML files, storing byte offsets into a hierarchical data structure (XML)
>> seems like a sub-optimal solution to me. Aside from losing the hierarchy
>> knowledge, if you have a skewed distribution of elements in the XML
>> document, you can't get good locality in your query/analytic. What was
>> your idea behind storing the offsets?
>>
>> - Josh
>>
>> On 6/6/2012 10:19 PM, William Slacum wrote: