|
|
Kini, Ameet M. 2012-03-09, 14:07
In 1.4, is there a way to use built-in iterators to run the following query : "get the name and salary of all employees where the salary is between X and Y"
Assuming a straightforward schema where name and salary are both cq.
I'd like both the cq restriction and the range predicate applied on the tservers.
I see that Scanner.setColumnQualifierRegex would take care of the cq restriction. But I don't know of a built-in iterator for the range predicate and I don't know of how to compose those two iterators.
Thanks, -Ameet Kini
-
Re: filter on value ranges
Aaron Cordova 2012-03-09, 14:18
The best way is to build a separate numerical index on the salary field. The Accumulo table would look like this:
rowID colfam colqual value
0040000 salary employeeY [blank] 0041000 salary employeeJ [blank] 0042000 salary employeeV [blank] 0043000 salary employeeB [blank] 0044000 salary employeeR [blank] 0045000 salary employeeG [blank] where 'employeeY' refers to the rowID of your main table.
A numerical index may need to deal with negative numbers, and with arbitrarily large numbers. Depending on your needs, you'll have to transform your numbers into strings that, when sorted lexicographically, reflect the proper numerical sort order you require. The above example uses 0-padding, which doesn't account for negative numbers or arbitrarily large numbers (i.e. numbers over 9,999,999 will not sort correctly).
Let's call your transform function trans()
Then you can answer your query via a single scan, starting at trans(X) and ending at trans(Y). If employee names are used as the rowID, you're done.
If the employee names are stored as values under a field in your main table, you extract the column qualifiers from the keys returned, and pass them in a List to a BatchScanner that is configured to scan your main table and retrieve the employee names, i.e. configured to retrieve just the column family:qualifier under which employee name is stored.
This is, admittedly, a pain. But it's doable and it scales.
On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote:
> > In 1.4, is there a way to use built-in iterators to run the following query : > “get the name and salary of all employees where the salary is between X and Y” > > Assuming a straightforward schema where name and salary are both cq. > > I’d like both the cq restriction and the range predicate applied on the tservers. > > I see that Scanner.setColumnQualifierRegex would take care of the cq restriction. But I don’t know of a built-in iterator for the range predicate and I don’t know of how to compose those two iterators. > > Thanks, > -Ameet Kini >
-
Re: filter on value ranges
Aaron Cordova 2012-03-09, 14:20
To answer your question, I would not use built-in iterators for this.
But if you were determined, you could use what is known as 'document sharding' as opposed to 'term sharding' and use an intersecting iterator.
Instructions on how to do this should be added to the manual ... On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote:
> > In 1.4, is there a way to use built-in iterators to run the following query : > “get the name and salary of all employees where the salary is between X and Y” > > Assuming a straightforward schema where name and salary are both cq. > > I’d like both the cq restriction and the range predicate applied on the tservers. > > I see that Scanner.setColumnQualifierRegex would take care of the cq restriction. But I don’t know of a built-in iterator for the range predicate and I don’t know of how to compose those two iterators. > > Thanks, > -Ameet Kini >
-
RE: filter on value ranges
Kini, Ameet M. 2012-03-09, 18:11
Thanks for the comments.
I'm ok with rolling my own iterator/filter but not sure how to go about doing it (see next para), so it'd be great to get pointers on it. I'd prefer keeping the schema to how it is today where each employee is represented by a row in the table with a properties cf containing name and salary cq. Here's how it looks today
rowID colfam colqual value
abc properties name john abc properties salary 10000 def properties name alice def properties salary 20000
Part of my confusion lies in not knowing how to implement this range filter class, because my query needs to get both the name as well as salary based on a particular salary. What I would like to do is something like a Filter equivalent to WholeRowIterator, say WholeRowFilter whose accept(Key k, Value v) was provided the entire row in the Value argument alongwith appropriate encodeRow/decodeRow as in WholeRowIterator. If the accept method returns true, the whole row is returned to the client. Then I could extend this class by writing a MyRangeFilter which would look inside the row and make row level accept/reject decisions based on values of particular cq.
Maybe this WholeRowFilter is already there in some form?
-Ameet Kini
From: Aaron Cordova [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> Sent: Friday, March 09, 2012 9:20 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: filter on value ranges
To answer your question, I would not use built-in iterators for this.
But if you were determined, you could use what is known as 'document sharding' as opposed to 'term sharding' and use an intersecting iterator.
Instructions on how to do this should be added to the manual ... On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote:
In 1.4, is there a way to use built-in iterators to run the following query : "get the name and salary of all employees where the salary is between X and Y"
Assuming a straightforward schema where name and salary are both cq.
I'd like both the cq restriction and the range predicate applied on the tservers.
I see that Scanner.setColumnQualifierRegex would take care of the cq restriction. But I don't know of a built-in iterator for the range predicate and I don't know of how to compose those two iterators.
Thanks, -Ameet Kini
-
Re: filter on value ranges
Keith Turner 2012-03-09, 19:48
The WholeRowIterator can filter rows, just override it and implement the filter function.
Also new in 1.4 is org.apache.accumulo.core.iterators.user.RowFilter. If provides similar functionality, but does not require reading the entire row into memory.
Keith
On Fri, Mar 9, 2012 at 1:11 PM, Kini, Ameet M. <[EMAIL PROTECTED]> wrote: > > > > > Thanks for the comments. > > > > I’m ok with rolling my own iterator/filter but not sure how to go about > doing it (see next para), so it’d be great to get pointers on it. I’d > prefer keeping the schema to how it is today where each employee is > represented by a row in the table with a properties cf containing name and > salary cq. Here’s how it looks today > > > > rowID colfam colqual value > > > > abc properties name john > > abc properties salary 10000 > > def properties name alice > > def properties salary 20000 > > > > Part of my confusion lies in not knowing how to implement this range filter > class, because my query needs to get both the name as well as salary based > on a particular salary. What I would like to do is something like a Filter > equivalent to WholeRowIterator, say WholeRowFilter whose accept(Key k, Value > v) was provided the entire row in the Value argument alongwith appropriate > encodeRow/decodeRow as in WholeRowIterator. If the accept method returns > true, the whole row is returned to the client. Then I could extend this > class by writing a MyRangeFilter which would look inside the row and make > row level accept/reject decisions based on values of particular cq. > > > > Maybe this WholeRowFilter is already there in some form? > > > > -Ameet Kini > > > > From: Aaron Cordova [mailto:[EMAIL PROTECTED]] > Sent: Friday, March 09, 2012 9:20 AM > To: [EMAIL PROTECTED] > Subject: Re: filter on value ranges > > > > To answer your question, I would not use built-in iterators for this. > > > > But if you were determined, you could use what is known as 'document > sharding' as opposed to 'term sharding' and use an intersecting iterator. > > > > Instructions on how to do this should be added to the manual ... > > > > > > On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote: > > > > > > In 1.4, is there a way to use built-in iterators to run the following query > : > > “get the name and salary of all employees where the salary is between X > and Y” > > > > Assuming a straightforward schema where name and salary are both cq. > > > > I’d like both the cq restriction and the range predicate applied on the > tservers. > > > > I see that Scanner.setColumnQualifierRegex would take care of the cq > restriction. But I don’t know of a built-in iterator for the range predicate > and I don’t know of how to compose those two iterators. > > > > Thanks, > > -Ameet Kini > > > >
-
Re: filter on value ranges
Aaron Cordova 2012-03-09, 20:31
If I may make one more argument for not using a filter and using a separate table as a secondary index instead, keep in mind you'll have to scan over the entire table to perform this query, since the rows containing the values you're after may appear anywhere in the table, i.e. all your queries will take a long time.
For most users of Accumulo time is more precious than storage space (which keeps getting cheaper and more plentiful, unlike time), so creating a secondary index is the path usually chosen over full table scans. On Mar 9, 2012, at 2:48 PM, Keith Turner wrote:
> The WholeRowIterator can filter rows, just override it and implement > the filter function. > > Also new in 1.4 is org.apache.accumulo.core.iterators.user.RowFilter. > If provides similar functionality, but does not require reading the > entire row into memory. > > Keith > > On Fri, Mar 9, 2012 at 1:11 PM, Kini, Ameet M. <[EMAIL PROTECTED]> wrote: >> >> >> >> >> Thanks for the comments. >> >> >> >> I’m ok with rolling my own iterator/filter but not sure how to go about >> doing it (see next para), so it’d be great to get pointers on it. I’d >> prefer keeping the schema to how it is today where each employee is >> represented by a row in the table with a properties cf containing name and >> salary cq. Here’s how it looks today >> >> >> >> rowID colfam colqual value >> >> >> >> abc properties name john >> >> abc properties salary 10000 >> >> def properties name alice >> >> def properties salary 20000 >> >> >> >> Part of my confusion lies in not knowing how to implement this range filter >> class, because my query needs to get both the name as well as salary based >> on a particular salary. What I would like to do is something like a Filter >> equivalent to WholeRowIterator, say WholeRowFilter whose accept(Key k, Value >> v) was provided the entire row in the Value argument alongwith appropriate >> encodeRow/decodeRow as in WholeRowIterator. If the accept method returns >> true, the whole row is returned to the client. Then I could extend this >> class by writing a MyRangeFilter which would look inside the row and make >> row level accept/reject decisions based on values of particular cq. >> >> >> >> Maybe this WholeRowFilter is already there in some form? >> >> >> >> -Ameet Kini >> >> >> >> From: Aaron Cordova [mailto:[EMAIL PROTECTED]] >> Sent: Friday, March 09, 2012 9:20 AM >> To: [EMAIL PROTECTED] >> Subject: Re: filter on value ranges >> >> >> >> To answer your question, I would not use built-in iterators for this. >> >> >> >> But if you were determined, you could use what is known as 'document >> sharding' as opposed to 'term sharding' and use an intersecting iterator. >> >> >> >> Instructions on how to do this should be added to the manual ... >> >> >> >> >> >> On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote: >> >> >> >> >> >> In 1.4, is there a way to use built-in iterators to run the following query >> : >> >> “get the name and salary of all employees where the salary is between X >> and Y” >> >> >> >> Assuming a straightforward schema where name and salary are both cq. >> >> >> >> I’d like both the cq restriction and the range predicate applied on the >> tservers. >> >> >> >> I see that Scanner.setColumnQualifierRegex would take care of the cq >> restriction. But I don’t know of a built-in iterator for the range predicate >> and I don’t know of how to compose those two iterators. >> >> >> >> Thanks, >> >> -Ameet Kini >> >> >> >>
-
Re: filter on value ranges
Keith Turner 2012-03-09, 20:49
Using a filter vs building an index depends on the circumstances. Are you going to ask the question multiple times? How much data will the query match? Etc. If 50% of the rows will match the filter criteria, do a full table scan is ok and you probably want to do that scan using map reduce. The map reduce job can use a filter to push computation to the tserver.
Keith
On Fri, Mar 9, 2012 at 3:31 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > If I may make one more argument for not using a filter and using a separate table as a secondary index instead, keep in mind you'll have to scan over the entire table to perform this query, since the rows containing the values you're after may appear anywhere in the table, i.e. all your queries will take a long time. > > For most users of Accumulo time is more precious than storage space (which keeps getting cheaper and more plentiful, unlike time), so creating a secondary index is the path usually chosen over full table scans. > > > On Mar 9, 2012, at 2:48 PM, Keith Turner wrote: > >> The WholeRowIterator can filter rows, just override it and implement >> the filter function. >> >> Also new in 1.4 is org.apache.accumulo.core.iterators.user.RowFilter. >> If provides similar functionality, but does not require reading the >> entire row into memory. >> >> Keith >> >> On Fri, Mar 9, 2012 at 1:11 PM, Kini, Ameet M. <[EMAIL PROTECTED]> wrote: >>> >>> >>> >>> >>> Thanks for the comments. >>> >>> >>> >>> I’m ok with rolling my own iterator/filter but not sure how to go about >>> doing it (see next para), so it’d be great to get pointers on it. I’d >>> prefer keeping the schema to how it is today where each employee is >>> represented by a row in the table with a properties cf containing name and >>> salary cq. Here’s how it looks today >>> >>> >>> >>> rowID colfam colqual value >>> >>> >>> >>> abc properties name john >>> >>> abc properties salary 10000 >>> >>> def properties name alice >>> >>> def properties salary 20000 >>> >>> >>> >>> Part of my confusion lies in not knowing how to implement this range filter >>> class, because my query needs to get both the name as well as salary based >>> on a particular salary. What I would like to do is something like a Filter >>> equivalent to WholeRowIterator, say WholeRowFilter whose accept(Key k, Value >>> v) was provided the entire row in the Value argument alongwith appropriate >>> encodeRow/decodeRow as in WholeRowIterator. If the accept method returns >>> true, the whole row is returned to the client. Then I could extend this >>> class by writing a MyRangeFilter which would look inside the row and make >>> row level accept/reject decisions based on values of particular cq. >>> >>> >>> >>> Maybe this WholeRowFilter is already there in some form? >>> >>> >>> >>> -Ameet Kini >>> >>> >>> >>> From: Aaron Cordova [mailto:[EMAIL PROTECTED]] >>> Sent: Friday, March 09, 2012 9:20 AM >>> To: [EMAIL PROTECTED] >>> Subject: Re: filter on value ranges >>> >>> >>> >>> To answer your question, I would not use built-in iterators for this. >>> >>> >>> >>> But if you were determined, you could use what is known as 'document >>> sharding' as opposed to 'term sharding' and use an intersecting iterator. >>> >>> >>> >>> Instructions on how to do this should be added to the manual ... >>> >>> >>> >>> >>> >>> On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote: >>> >>> >>> >>> >>> >>> In 1.4, is there a way to use built-in iterators to run the following query >>> : >>> >>> “get the name and salary of all employees where the salary is between X >>> and Y” >>> >>> >>> >>> Assuming a straightforward schema where name and salary are both cq. >>> >>> >>> >>> I’d like both the cq restriction and the range predicate applied on the >>> tservers. >>> >>> >>> >>> I see that Scanner.setColumnQualifierRegex would take care of the cq >>> restriction. But I don’t know of a built-in iterator for the range predicate
-
RE: filter on value ranges
Kini, Ameet M. 2012-03-09, 23:46
Keith/Aaron,
All good points. For the short term, I'm ok with a full scan for this query. The main requirement at this point is to push all filtering down to the tablet server.
Not sure if I understand why map reduce is needed - how does the Scanner + Custom Filter approach fall short over map reduce?
-Ameet Kini -----Original Message----- From: Keith Turner [mailto:[EMAIL PROTECTED]] Sent: Friday, March 09, 2012 3:50 PM To: [EMAIL PROTECTED] Subject: Re: filter on value ranges
Using a filter vs building an index depends on the circumstances. Are you going to ask the question multiple times? How much data will the query match? Etc. If 50% of the rows will match the filter criteria, do a full table scan is ok and you probably want to do that scan using map reduce. The map reduce job can use a filter to push computation to the tserver.
Keith
On Fri, Mar 9, 2012 at 3:31 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > If I may make one more argument for not using a filter and using a separate table as a secondary index instead, keep in mind you'll have to scan over the entire table to perform this query, since the rows containing the values you're after may appear anywhere in the table, i.e. all your queries will take a long time. > > For most users of Accumulo time is more precious than storage space (which keeps getting cheaper and more plentiful, unlike time), so creating a secondary index is the path usually chosen over full table scans. > > > On Mar 9, 2012, at 2:48 PM, Keith Turner wrote: > >> The WholeRowIterator can filter rows, just override it and implement >> the filter function. >> >> Also new in 1.4 is org.apache.accumulo.core.iterators.user.RowFilter. >> If provides similar functionality, but does not require reading the >> entire row into memory. >> >> Keith >> >> On Fri, Mar 9, 2012 at 1:11 PM, Kini, Ameet M. <[EMAIL PROTECTED]> wrote: >>> >>> >>> >>> >>> Thanks for the comments. >>> >>> >>> >>> I'm ok with rolling my own iterator/filter but not sure how to go about >>> doing it (see next para), so it'd be great to get pointers on it. I'd >>> prefer keeping the schema to how it is today where each employee is >>> represented by a row in the table with a properties cf containing name and >>> salary cq. Here's how it looks today >>> >>> >>> >>> rowID colfam colqual value >>> >>> >>> >>> abc properties name john >>> >>> abc properties salary 10000 >>> >>> def properties name alice >>> >>> def properties salary 20000 >>> >>> >>> >>> Part of my confusion lies in not knowing how to implement this range filter >>> class, because my query needs to get both the name as well as salary based >>> on a particular salary. What I would like to do is something like a Filter >>> equivalent to WholeRowIterator, say WholeRowFilter whose accept(Key k, Value >>> v) was provided the entire row in the Value argument alongwith appropriate >>> encodeRow/decodeRow as in WholeRowIterator. If the accept method returns >>> true, the whole row is returned to the client. Then I could extend this >>> class by writing a MyRangeFilter which would look inside the row and make >>> row level accept/reject decisions based on values of particular cq. >>> >>> >>> >>> Maybe this WholeRowFilter is already there in some form? >>> >>> >>> >>> -Ameet Kini >>> >>> >>> >>> From: Aaron Cordova [mailto:[EMAIL PROTECTED]] >>> Sent: Friday, March 09, 2012 9:20 AM >>> To: [EMAIL PROTECTED] >>> Subject: Re: filter on value ranges >>> >>> >>> >>> To answer your question, I would not use built-in iterators for this. >>> >>> >>> >>> But if you were determined, you could use what is known as 'document >>> sharding' as opposed to 'term sharding' and use an intersecting iterator. >>> >>> >>> >>> Instructions on how to do this should be added to the manual ... >>> >>> >>> >>> >>> >>> On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote: >>> >>> >>> >>> >>> >>> In 1.4, is there a way to use built-in iterators to run the following query
-
Re: filter on value ranges
Keith Turner 2012-03-10, 00:15
The scanner is reading the data sequentially and pushing the filtering out to tablet servers sequentially. You can use a batch scanner to initiate parallel filtering on tablets server, however if a lot data would be returned use map reduce.
On Fri, Mar 9, 2012 at 6:46 PM, Kini, Ameet M. <[EMAIL PROTECTED]> wrote: > > > Keith/Aaron, > > All good points. For the short term, I'm ok with a full scan for this query. The main requirement at this point is to push all filtering down to the tablet server. > > Not sure if I understand why map reduce is needed - how does the Scanner + Custom Filter approach fall short over map reduce? > > -Ameet Kini > > > -----Original Message----- > From: Keith Turner [mailto:[EMAIL PROTECTED]] > Sent: Friday, March 09, 2012 3:50 PM > To: [EMAIL PROTECTED] > Subject: Re: filter on value ranges > > Using a filter vs building an index depends on the circumstances. Are > you going to ask the question multiple times? How much data will the > query match? Etc. If 50% of the rows will match the filter > criteria, do a full table scan is ok and you probably want to do that > scan using map reduce. The map reduce job can use a filter to push > computation to the tserver. > > Keith > > On Fri, Mar 9, 2012 at 3:31 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >> If I may make one more argument for not using a filter and using a separate table as a secondary index instead, keep in mind you'll have to scan over the entire table to perform this query, since the rows containing the values you're after may appear anywhere in the table, i.e. all your queries will take a long time. >> >> For most users of Accumulo time is more precious than storage space (which keeps getting cheaper and more plentiful, unlike time), so creating a secondary index is the path usually chosen over full table scans. >> >> >> On Mar 9, 2012, at 2:48 PM, Keith Turner wrote: >> >>> The WholeRowIterator can filter rows, just override it and implement >>> the filter function. >>> >>> Also new in 1.4 is org.apache.accumulo.core.iterators.user.RowFilter. >>> If provides similar functionality, but does not require reading the >>> entire row into memory. >>> >>> Keith >>> >>> On Fri, Mar 9, 2012 at 1:11 PM, Kini, Ameet M. <[EMAIL PROTECTED]> wrote: >>>> >>>> >>>> >>>> >>>> Thanks for the comments. >>>> >>>> >>>> >>>> I'm ok with rolling my own iterator/filter but not sure how to go about >>>> doing it (see next para), so it'd be great to get pointers on it. I'd >>>> prefer keeping the schema to how it is today where each employee is >>>> represented by a row in the table with a properties cf containing name and >>>> salary cq. Here's how it looks today >>>> >>>> >>>> >>>> rowID colfam colqual value >>>> >>>> >>>> >>>> abc properties name john >>>> >>>> abc properties salary 10000 >>>> >>>> def properties name alice >>>> >>>> def properties salary 20000 >>>> >>>> >>>> >>>> Part of my confusion lies in not knowing how to implement this range filter >>>> class, because my query needs to get both the name as well as salary based >>>> on a particular salary. What I would like to do is something like a Filter >>>> equivalent to WholeRowIterator, say WholeRowFilter whose accept(Key k, Value >>>> v) was provided the entire row in the Value argument alongwith appropriate >>>> encodeRow/decodeRow as in WholeRowIterator. If the accept method returns >>>> true, the whole row is returned to the client. Then I could extend this >>>> class by writing a MyRangeFilter which would look inside the row and make >>>> row level accept/reject decisions based on values of particular cq. >>>> >>>> >>>> >>>> Maybe this WholeRowFilter is already there in some form? >>>> >>>> >>>> >>>> -Ameet Kini >>>> >>>> >>>> >>>> From: Aaron Cordova [mailto:[EMAIL PROTECTED]] >>>> Sent: Friday, March 09, 2012 9:20 AM >>>> To: [EMAIL PROTECTED] >>>> Subject: Re: filter on value ranges >>>> >>>> >
|
|