Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> filter on value ranges

Copy link to this message
Re: filter on value ranges
The best way is to build a separate numerical index on the salary field. The Accumulo table would look like this:

rowID colfam colqual value

0040000 salary employeeY [blank]
0041000 salary employeeJ [blank]
0042000 salary employeeV [blank]
0043000 salary employeeB [blank]
0044000 salary employeeR [blank]
0045000 salary employeeG [blank]
where 'employeeY' refers to the rowID of your main table.

A numerical index may need to deal with negative numbers, and with arbitrarily large numbers. Depending on your needs, you'll have to transform your numbers into strings that, when sorted lexicographically, reflect the proper numerical sort order you require. The above example uses 0-padding, which doesn't account for negative numbers or arbitrarily large numbers (i.e. numbers over 9,999,999 will not sort correctly).

Let's call your transform function trans()

Then you can answer your query via a single scan, starting at trans(X) and ending at trans(Y). If employee names are used as the rowID, you're done.

If the employee names are stored as values under a field in your main table, you extract the column qualifiers from the keys returned, and pass them in a List to a BatchScanner that is configured to scan your main table and retrieve the employee names, i.e. configured to retrieve just the column family:qualifier under which employee name is stored.

This is, admittedly, a pain. But it's doable and it scales.

On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote:

> In 1.4, is there a way to use built-in iterators to run the following query :
>   “get the name and salary of all employees where the salary is between X and Y”
> Assuming a straightforward schema where name and salary are both cq.
> I’d like both the cq restriction and the range predicate applied on the tservers.
> I see that Scanner.setColumnQualifierRegex would take care of the cq restriction. But I don’t know of a built-in iterator for the range predicate and I don’t know of how to compose those two iterators.
> Thanks,
> -Ameet Kini