Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Dealing with large data sets in client


Copy link to this message
-
Dealing with large data sets in client
Hello,

I have timeseries data, most rows have anywhere from 10 to a few thousand
columns, but outliers can have a million or more.  Each column has some
integer value (counters), and an integer identifier is the qualifier.  On
the client side, I want to scan from startDate to endDate, add up the total
values for each identifier, sort the aggregated values, and return the top
X (pagination).  We do this using a map since many identifiers may
intersect, but not all will.  This works fine for the majority of our
users, but for those outliers we end up running out of memory.  Since we
know the columns are sorted in each row, we could save memory by stepping
through the columns for each returned row together, and keep a list of the
top X as we add them up.  The problem with this is that the Scan api does
not give us access to the data in this way.  You must always get the next
row, then you can batch through the columns for that row, then move on to
the next row.

Has anyone dealt with this kind of use case, and is there any way we can
implement the above read pattern with current the API or otherwise step
through the data?  I imagine it isn't a great idea to create a ton of scans
(1 for each row), which is the only way I can think to do the above with
what we have.

Thanks,

Bryan
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB