Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> XML Storage - Accumulo or HDFS

Copy link to this message
Re: XML Storage - Accumulo or HDFS
So, if your XML looks like the snippet you posted, it's extremely easy
to fetch records based on the KEY_FIELD or TAG element. A (relatively)
flat XML document is rather trivial to map into the wikipedia example.
As I was saying previously, it gets trickier when you have a deep or a
deep and wide structure.

If your requirement is to find the /n/ RECORDs before and after a given
RECORD, then yes, the wiki example wouldn't make much sense; however,
you could add an attribute to each RECORD to denote positional
information in the original file which would alleviate this problem.
 From an application sense, it usually doesn't make sense to index
documents purely off of their positional information in the source data
(as you suggested using the byte file offsets) because that's not how
you're going to want to query it in your application. I would assume
you'd want to be querying off of KEY_FIELD or TAG.

- Josh

On 6/7/12 7:29 AM, David Medinets wrote:
> On Wed, Jun 6, 2012 at 10:50 PM, Josh Elser<[EMAIL PROTECTED]>  wrote:
>>   Aside from losing the hierarchy
>> knowledge, if you have a skewed distribution of elements in the XML
>> document, you can't get good locality in your query/analytic. What was your
>> idea behind storing the offsets?
>   <RECORD>
>    <KEY_FIELD/>
>    <TAG/>
>   </RECORD>
>   <RECORD>
>    <KEY_FIELD/>
>    <TAG/>
>   </RECORD>
> My XML looks like that. I don't know how the information in the XML
> will be used in the future and I don't want to re-scan large numbers
> of XML to find a single record. For example, yesterday we found a
> potential bug. My bug analysis showed the source data was in record X
> of 450,000 records. Since I know which XML file held that record, I
> was able to get that file locally and use command-line tools to find
> surrounding information. My XML file might have 200 tags but normally
> I only need 45 of them. My XML is without hierarchy.