Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> sorting in Accumulo


Copy link to this message
-
Re: sorting in Accumulo
You could ingest this data into accumulo using the following "schema"

row:       timestamp
colfam:  "record"
colqual: md5(JSON)
value:   JSON record

Accumulo would sort this for you in lexicographical order by timestamp
(stored as a string). Depending on the range your data comes from, if
all the epoch timestamps are the same length, then lexigraphical
should equal numeric sorting.  If this is not the case for you, then
you could convert your timestamps to a string using the following
template (with each field zero padded to its max length):

${year}${month}{$day}${hour}${minute}${second}

The md5(JSON) is there b/c I assume some of your events could have the
same timestamp.  If you could have events that are exactly the same
(and you need to track this) you may want to append a one-up counter
to the md5 just to gurantee that you won't overwritten duplicates.
Without the md5 (or another simialr mechanism), Accumulo would
overwrite any previously stored values with the exact same [row,
colfam, colqual, colvis].

Iterating in temporal order would just be a simple full table scan.

I hope this helps.

--Jason

On Tue, Mar 6, 2012 at 12:15 PM, John R. Frank <[EMAIL PROTECTED]> wrote:
> Accumulo Experts,
>
> Is there an example of working with a time-ordered stream in Accumulo?
>
>
> Given:
>        ~500M JSON records each about 30kb
>        each record hasa timestamp field (seconds since the epoch)
>
>
> Goal:
>        iterate over all records in temporal order
>        run some function on this simulated stream
>
>
> Thanks for any pointers or advice!
>
> John