|
|
John R. Frank 2012-03-06, 17:15
Accumulo Experts,
Is there an example of working with a time-ordered stream in Accumulo? Given: ~500M JSON records each about 30kb each record hasa timestamp field (seconds since the epoch) Goal: iterate over all records in temporal order run some function on this simulated stream Thanks for any pointers or advice!
John
-
Re: sorting in Accumulo
Jason Trost 2012-03-06, 18:06
You could ingest this data into accumulo using the following "schema"
row: timestamp colfam: "record" colqual: md5(JSON) value: JSON record
Accumulo would sort this for you in lexicographical order by timestamp (stored as a string). Depending on the range your data comes from, if all the epoch timestamps are the same length, then lexigraphical should equal numeric sorting. If this is not the case for you, then you could convert your timestamps to a string using the following template (with each field zero padded to its max length):
${year}${month}{$day}${hour}${minute}${second}
The md5(JSON) is there b/c I assume some of your events could have the same timestamp. If you could have events that are exactly the same (and you need to track this) you may want to append a one-up counter to the md5 just to gurantee that you won't overwritten duplicates. Without the md5 (or another simialr mechanism), Accumulo would overwrite any previously stored values with the exact same [row, colfam, colqual, colvis].
Iterating in temporal order would just be a simple full table scan.
I hope this helps.
--Jason
On Tue, Mar 6, 2012 at 12:15 PM, John R. Frank <[EMAIL PROTECTED]> wrote: > Accumulo Experts, > > Is there an example of working with a time-ordered stream in Accumulo? > > > Given: > ~500M JSON records each about 30kb > each record hasa timestamp field (seconds since the epoch) > > > Goal: > iterate over all records in temporal order > run some function on this simulated stream > > > Thanks for any pointers or advice! > > John
-
Re: sorting in Accumulo
Keith Turner 2012-03-06, 18:26
If you want to sort in descending order, you can make the row (Long.MAX_VALUE - timestamp). Stil make this fixed width. On Tue, Mar 6, 2012 at 1:06 PM, Jason Trost <[EMAIL PROTECTED]> wrote: > You could ingest this data into accumulo using the following "schema" > > row: timestamp > colfam: "record" > colqual: md5(JSON) > value: JSON record > > Accumulo would sort this for you in lexicographical order by timestamp > (stored as a string). Depending on the range your data comes from, if > all the epoch timestamps are the same length, then lexigraphical > should equal numeric sorting. If this is not the case for you, then > you could convert your timestamps to a string using the following > template (with each field zero padded to its max length): > > ${year}${month}{$day}${hour}${minute}${second} > > The md5(JSON) is there b/c I assume some of your events could have the > same timestamp. If you could have events that are exactly the same > (and you need to track this) you may want to append a one-up counter > to the md5 just to gurantee that you won't overwritten duplicates. > Without the md5 (or another simialr mechanism), Accumulo would > overwrite any previously stored values with the exact same [row, > colfam, colqual, colvis]. > > Iterating in temporal order would just be a simple full table scan. > > I hope this helps. > > --Jason > > On Tue, Mar 6, 2012 at 12:15 PM, John R. Frank <[EMAIL PROTECTED]> wrote: >> Accumulo Experts, >> >> Is there an example of working with a time-ordered stream in Accumulo? >> >> >> Given: >> ~500M JSON records each about 30kb >> each record hasa timestamp field (seconds since the epoch) >> >> >> Goal: >> iterate over all records in temporal order >> run some function on this simulated stream >> >> >> Thanks for any pointers or advice! >> >> John
-
Re: sorting in Accumulo
David Medinets 2012-03-06, 18:45
>> The md5(JSON) is there b/c I assume some of your events could have the >> same timestamp.
So the MD5 acts as a uuid? What is the chance of a value collision? Is the chance calculable?
-
Re: sorting in Accumulo
Keith Turner 2012-03-06, 19:00
Another way around the duplicate issue that Jason pointed out is to modify the Versioning iterator to keep more than one version. You could set max versions to MAX_LONG. Do this instead of putting the md5 in the key. This way, even if the timestamp is the same you will still keep the data.
The only problem with this is if you insert the exact same column/value in a mutation twice only one will be kept as described in ACCUMULO-227. Otherwise all versions of a key will be kept. On Tue, Mar 6, 2012 at 1:06 PM, Jason Trost <[EMAIL PROTECTED]> wrote: > You could ingest this data into accumulo using the following "schema" > > row: timestamp > colfam: "record" > colqual: md5(JSON) > value: JSON record > > Accumulo would sort this for you in lexicographical order by timestamp > (stored as a string). Depending on the range your data comes from, if > all the epoch timestamps are the same length, then lexigraphical > should equal numeric sorting. If this is not the case for you, then > you could convert your timestamps to a string using the following > template (with each field zero padded to its max length): > > ${year}${month}{$day}${hour}${minute}${second} > > The md5(JSON) is there b/c I assume some of your events could have the > same timestamp. If you could have events that are exactly the same > (and you need to track this) you may want to append a one-up counter > to the md5 just to gurantee that you won't overwritten duplicates. > Without the md5 (or another simialr mechanism), Accumulo would > overwrite any previously stored values with the exact same [row, > colfam, colqual, colvis]. > > Iterating in temporal order would just be a simple full table scan. > > I hope this helps. > > --Jason > > On Tue, Mar 6, 2012 at 12:15 PM, John R. Frank <[EMAIL PROTECTED]> wrote: >> Accumulo Experts, >> >> Is there an example of working with a time-ordered stream in Accumulo? >> >> >> Given: >> ~500M JSON records each about 30kb >> each record hasa timestamp field (seconds since the epoch) >> >> >> Goal: >> iterate over all records in temporal order >> run some function on this simulated stream >> >> >> Thanks for any pointers or advice! >> >> John
-
Re: sorting in Accumulo
Billie J Rinaldi 2012-03-06, 19:01
On Tuesday, March 6, 2012 1:45:52 PM, "David Medinets" <[EMAIL PROTECTED]> wrote: > >> The md5(JSON) is there b/c I assume some of your events could have > >> the > >> same timestamp. > > So the MD5 acts as a uuid? What is the chance of a value collision? Is > the chance calculable?
Essentially the timestamp and md5 together are acting as a UUID. You would have to evaluate for your data set the chance of a collision for any given timestamp. If it's unacceptable, a different hash could be used.
Billie
-
Re: sorting in Accumulo
John R. Frank 2012-03-09, 14:52
On Tue, Mar 6, 2012 at 1:06 PM, Jason Trost <[EMAIL PROTECTED]> wrote: > You could ingest this data into accumulo using the following "schema" > > row: timestamp > colfam: "record" > colqual: md5(JSON) > value: JSON record We do have records with same timestamp, so yes collisions occur at that level.
We also have a "stream_id" field which is a unique ID constructed from integer timestamp and md5 of the abs_url from which the content was fetched -- for our corpus that is sufficiently unique that collisions occur with essentially zero probability. stream_id = 123456789-AAAABBBBCCCCDDDDEEEEFFFF0000 ^^^^^^^^^ timestamp
I could convert the stream_id to be zero padded to the left to ensure that the integer is always fixed length. If we do that, do we need colqual?
Sounds like this schema be sufficient for sorting in temporal order with no meaningful order within a given second -- that would be fine for our purposes. row: stream_id colfam: "record" value: JSON record Thanks for all the responses!
jrf
-
Re: sorting in Accumulo
Billie J Rinaldi 2012-03-09, 16:55
On Friday, March 9, 2012 9:52:11 AM, "John R. Frank" <[EMAIL PROTECTED]> wrote: > On Tue, Mar 6, 2012 at 1:06 PM, Jason Trost <[EMAIL PROTECTED]> > wrote: > We do have records with same timestamp, so yes collisions occur at > that > level. > > We also have a "stream_id" field which is a unique ID constructed from > integer timestamp and md5 of the abs_url from which the content was > fetched -- for our corpus that is sufficiently unique that collisions > occur with essentially zero probability. > > > stream_id = 123456789-AAAABBBBCCCCDDDDEEEEFFFF0000 > ^^^^^^^^^ > timestamp > > I could convert the stream_id to be zero padded to the left to ensure > that > the integer is always fixed length. If we do that, do we need colqual?
Yes, if the unique ID is in the row you could leave the column qualifier empty.
Billie > Sounds like this schema be sufficient for sorting in temporal order > with > no meaningful order within a given second -- that would be fine for > our > purposes. > > > row: stream_id > colfam: "record" > value: JSON record > > > Thanks for all the responses! > > jrf
|
|