Yes. It will be less useful if you can't scan only the newest data, as you'll be recombining the same pieces of data on subsequent runs. On Fri, May 16, 2014 at 1:54 PM, David Medinets <[EMAIL PROTECTED]>wrote:
Yes, the data has not yet been ingested. I can control the table structure; hopefully by integrating (or extending) the D4M schema.
I'm leaning towards using https://github.com/addthis/stream-lib as part of the ingest process. Upon start up, existing tables would be analyzed to find cardinality. Then as records are ingested, the cardinality would be adjusted as needed. I don't yet know how to store the cardinality information so that restarting the ingest process doesn't require re-processing all the data. Still researching. On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <[EMAIL PROTECTED]> wrote:
woops, sorry for the empty response, but I'm new to E-mail. The bitset within HLL supports union and intersection. You should be able to estimate cardinality without re-reading the data. In effect, you can segment your estimation and minimize error < about 2%.
Union is straightforward, whereas intersection is |FIELD+1| + |FIELD_2| - |FIELD_1 UNION FIELD_2| On Fri, May 16, 2014 at 9:17 PM, Marc Parisi <[EMAIL PROTECTED]> wrote:
I'm thinking maybe your mappings could be like this: group=anything, type=NAME, name=John(etc...)
perhaps a ColumnQualifierGrouping iterator could be applied at scan time to add up the cardinalities for the quals over the given time range being scanned where cardinalities across different time units get aggregated client side. On Fri, May 16, 2014 at 5:19 PM, David Medinets <[EMAIL PROTECTED]>wrote:
This project is something to occupy me my spare time. And it's intended to explore aspects of Accumulo that I haven't needed to use yet. In the past, I simply ran a map-reduce job using the Word Counting technique.
tl;dr - The expected size of the unique key key would be in the millions. Too large to calculate on-the-fly for a web application. On Fri, May 16, 2014 at 6:04 PM, Corey Nolet <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext