Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> XML Storage - Accumulo or HDFS


+
Perko, Ralph J 2012-06-06, 20:20
+
William Slacum 2012-06-07, 01:57
+
David Medinets 2012-06-07, 02:06
+
William Slacum 2012-06-07, 02:19
+
Josh Elser 2012-06-07, 02:50
+
David Medinets 2012-06-07, 11:29
+
Josh Elser 2012-06-07, 12:06
+
Perko, Ralph J 2012-06-07, 14:48
Copy link to this message
-
Re: XML Storage - Accumulo or HDFS
My "inflated key" comment, I'll pull from Eric Newton's comment on the
"Table design" thread:

"Accumulo will accomodate keys that are very large (like 100K) but I
don't recommend it. It makes indexes big and slows down just about every
operation"

As applied to your example, you might generate the following keys if you
took the wikisearch approach:

# Represent your document as such: the row "4" being an arbitrary
bucket, and the CF "1234abcd" being some unique identifier for your
document (a hash of <book> for example)

4   1234abcd:title\x00basket weaving
4   1234abcd:author\x00bob
4   1234abcd:toc\x00stuff
4   1234abcd:citation\x00another book

# Then some indices inside the same row (bucket), creating an
in-partition index over the fields of your data. You could also shove
the tokenized content from your chapters in here.
4   fi\x00title:basket weaving\x001234abcd
4   fi\x00author:bob\x001234abcd
4   fi\x00toc:stuff\x001234abcd
4   fi\x00citation:another book\x001234abcd

# For those big chapters, store them off to the side, perhaps in their
own locality group. Will keep this data in separate files.
4 chapters:1234abcd\x001    Value:byte[chapter one data]
4 chapters:1234abcd\x002    Value:byte[chapter two data]

# Then perhaps some records pointing to data you expect users to query
on in a separate table (inverted index)
basket weaving    title:4\x001234abcd
bob    author:4\x001234abcd
another book    citation:4\x001234abcd

- Josh

On 6/7/2012 10:48 AM, Perko, Ralph J wrote:
> My use-case is very similar to the Wikipedia example. I'm not sure what
> you mean by the inflated key.  Can you expand on that?  I am not really
> pulling out individual elements/attributes to simply store them apart from
> the XML.  Any element I pull out is part of a larger analytic process and
> it is this result I store.  I am doing some graph worked based on
> relationships between elements.
>
> Example:
>
> <books>
>    <book>
>      <title>basket weaving</title>
>      <author>bob</author>
>      <toc>�</toc>
>      <chapter number=1>lots of text here</chapter>
>      <chapter number=2>even more text here</chapter>
>      <citation>another book</citation>
>    </book>
> </books>
>
>
> Each "book" is a record.  The book title is the row id.  The content is
> the XML<book>..</book>
>
> My table then has other columns such as "word count" or "character count"
> stored in the table.
>
> Table example:
>
> Row: basket weaving
> Col family: content
> Col qual: xml
> Value:<book>�</book>
>
>
> Row: basket weaving
> Col family: metrics
> Col qual: word count
> Value: 12345
>
> Row: basket weaving
> Col family:cites
> Col qual: another book
> Value: -- nothing meaningful
>
>
> Row: another book
> Col family:cited by
> Col qual: basket weaving
> Value: -- nothing meaningful
>
> I use the "cites" and "cited by" qualifiers for graphs
>
>
>
> On 6/6/12 7:50 PM, "Josh Elser"<[EMAIL PROTECTED]>  wrote:
>
>> +1, Bill. Assuming you aren't doing anything crazy in your XML files,
>> the wikipedia example should get you pretty far. That being said, the
>> structure used in the wikipedia example doesn't handle large lists of
>> elements -- short explanation: an attribute of a document is stored as
>> one key-vale pair, so if you have lot of large lists, you inflate the
>> key which does bad things. That in mind, there are small changes you can
>> make to the table structure to store those lists more efficiently and
>> still maintain the semantic representation (Bill's graph comment).
>>
>> David, ignoring any issues of data locality of the blocks in your large
>> XML files, storing byte offsets into a hierarchical data structure (XML)
>> seems like a sub-optimal solution to me. Aside from losing the hierarchy
>> knowledge, if you have a skewed distribution of elements in the XML
>> document, you can't get good locality in your query/analytic. What was
>> your idea behind storing the offsets?
>>
>> - Josh
>>
>> On 6/6/2012 10:19 PM, William Slacum wrote:
+
Eric Newton 2012-06-08, 02:27
+
Josh Elser 2012-06-08, 02:30
+
Perko, Ralph J 2012-06-08, 14:47
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB