I don't know the inner workings of the Rfiles enough but I was wondering if there is a faster way to get a unique list of columns in Accumulo (short of doing a full mapreduce). Is there some way to skip ahead all the volumes and just get to the next column?
There's not a single good way that I am aware of, but there are a couple ways that will get you close.
First, you can use the SortedKeyIterator to truncate values and potentially save yourself a lot of data transfer. Second, each RFile header block will track the columns contained, up to 1000 (possibly configurable). Check out PrintInfo.
Nope, you're stuck with enumerating the data in the common case because Accumulo doesn't limit the number of colfams that you can create. But, it is relatively easy to create a new table that tracks these for you (you can extrapolate this into keep occurrence counts too for more fun).
You can do an efficient skip over every column and row, but that's typically too slow even being as efficient as possible. On Feb 22, 2014 9:25 AM, "Arshak Navruzyan" <[EMAIL PROTECTED]> wrote:
I can't help but wonder if maybe the problem you're trying to solve could be done in a different way (like, when your RFiles are generated). What kinds of things are your trying to do with the enumeration of columns? Because, if you're trying to do something like show these in a drop-down box in a web interface or something, these could potentially be quite exhaustive... too big for even one machine to handle, in the general case. Except in very specific use cases, I can't imagine enumerating every column would be very useful. Perhaps yours is such a use case, but I wonder...