|
|
-
WholeRowIterator, BatchScanner, and fetchColumnFamily don't play well together?
John Armstrong 2012-07-09, 16:42
Hi everybody.
I've run across an unexpected behavior when using WholeRowIterator on a BatchScanner. In case it matters, we're using cloudbase-1.3.4.
When I tell it to fetchColumnFamily(new Text("foo")) I get no results back, though there are definitely records in that column family and in the row ranges I'm scanning. This doesn't happen when I use a scanner on that column family, though in that case I'm scanning over the entire table.
To be more explicit, some constants:
List<Range> ranges = new ArrayList<Range>(); ranges.add(new Range(new Text("bar"))); Text CF = new Text("foo");
getNewScanner() and getNewBatchScanner() create scanners for the appropriate table name, authorization, and number of threads.
BatchScanner batchScanner = getNewBatchScanner(); batchScanner.fetchColumnFamily(CF); batchScanner.setRanges(ranges);
returns all the entries in row "bar" and column family "foo".
BatchScanner batchScanner = getNewBatchScanner(); batchScanner.fetchColumnFamily(CF); batchScanner.setScanIterators(1, WholeRowIterator.class.getName(), UUID.randomUUID().toString()); batchScanner.setRanges(ranges);
returns nothing.
BatchScanner batchScanner = getNewBatchScanner(); batchScanner.setScanIterators(1, WholeRowIterator.class.getName(), UUID.randomUUID().toString()); batchScanner.setRanges(ranges);
returns an encoded entry containing all the entries in row "bar".
Scanner scanner = getNewScanner(); scanner.fetchColumnFamily(CF); scanner.setScanIterators(1, WholeRowIterator.class.getName(), UUID.randomUUID().toString());
returns encoded entries containing all the entries in column family "foo", one for each row that contains anything in that column family.
So, why does the second case return nothing?
TIA
-
Re: WholeRowIterator, BatchScanner, and fetchColumnFamily don't play well together?
John Armstrong 2012-07-09, 16:54
On 07/09/2012 12:42 PM, John Armstrong wrote: > I've run across an unexpected behavior when using WholeRowIterator on a > BatchScanner.
addendum: This does not seem to happen when using the MockCloudbase framework, so it didn't show up in my unit tests. If this is intentional behavior, MockCloudbase should probably behave the same way as the real thing, no?
-
Re: WholeRowIterator, BatchScanner, and fetchColumnFamily don't play well together?
John Vines 2012-07-09, 17:00
We have an open bug report about how Mock handles the reuse of Values. I'm wondering if perhaps there is a conflict there with BatchScanner in the same vain. More likely though, it's probably a case of BatchScanner behaving like a Scanner in Mock due to the non-distributed nature of Mock. I'm not really sure how we could expand the MockAccumulo framework to be MockParallel though.
John
On Mon, Jul 9, 2012 at 12:55 PM, John Armstrong <[EMAIL PROTECTED]> wrote:
> On 07/09/2012 12:42 PM, John Armstrong wrote: > >> I've run across an unexpected behavior when using WholeRowIterator on a >> BatchScanner. >> > > addendum: This does not seem to happen when using the MockCloudbase > framework, so it didn't show up in my unit tests. If this is intentional > behavior, MockCloudbase should probably behave the same way as the real > thing, no? >
-
Re: WholeRowIterator, BatchScanner, and fetchColumnFamily don't play well together?
John Armstrong 2012-07-09, 17:14
On 07/09/2012 01:00 PM, John Vines wrote: > We have an open bug report about how Mock handles the reuse of Values. > I'm wondering if perhaps there is a conflict there with BatchScanner in > the same vain. More likely though, it's probably a case of BatchScanner > behaving like a Scanner in Mock due to the non-distributed nature of > Mock. I'm not really sure how we could expand the MockAccumulo framework > to be MockParallel though.
Okay, I understand if that's hard to get exactly the same behavior in mock and production environments. But is the behavior I'm seeing in production the expected one?
-
Re: WholeRowIterator, BatchScanner, and fetchColumnFamily don't play well together?
Billie J Rinaldi 2012-07-09, 18:00
On Monday, July 9, 2012 1:14:09 PM, "John Armstrong" <[EMAIL PROTECTED]> wrote: > On 07/09/2012 01:00 PM, John Vines wrote: > > We have an open bug report about how Mock handles the reuse of > > Values. > > I'm wondering if perhaps there is a conflict there with BatchScanner > > in > > the same vain. More likely though, it's probably a case of > > BatchScanner > > behaving like a Scanner in Mock due to the non-distributed nature of > > Mock. I'm not really sure how we could expand the MockAccumulo > > framework > > to be MockParallel though. > > Okay, I understand if that's hard to get exactly the same behavior in > mock and production environments. But is the behavior I'm seeing in > production the expected one?
That does seem unusual. Adam and Keith are looking into it.
Billie
-
Re: WholeRowIterator, BatchScanner, and fetchColumnFamily don't play well together?
Adam Fuchs 2012-07-09, 18:59
John,
This was a fun one, but we figured it out. Thanks for providing code -- that helped a lot. The quick workaround is to set the priority of the WholeRowIterator to 21, above the VersioningIterator. Turns out the two iterators are not commutative, so order matters.
Solution: when you set up your WholeRowIterator, use priority 21 or greater: batchScanner.setScanIterators(21 /* voila */, WholeRowIterator.class.getName(), UUID.randomUUID().toString());
Here's what's happening:
First, a little background on notation. When you scan for a range including the row "bar", that range can be notated (bear with me): [bar : [] 9223372036854775807 false,bar%00; : [] 9223372036854775807 false) This shows two complete keys, separated by a comma, with empty column family, column qualifier, column visibility, a MAX_LONG timestamp, and delete flag set to false. The second key has a row that is the same as the first, but with a 0 byte value added on (%00;). In this notation, the [ means the left key is inclusive, and the ) means that the right key is exclusive.
In your query, you added a column family filter, so we (the Accumulo client library) got tricky and narrowed your range in addition to doing filtering. Here's what the narrowed range looks like: [bar foo: [] 9223372036854775807 false,bar foo%00;: [] 9223372036854775807 false) You can see the column family foo specified on the left, and foo%00; specified on the right side. This will select only everything in the foo column family within the row bar.
When the VersioningIterator seeks to a range it does a couple of interesting things. First, it widens the range to include all of the possible versions of a key by setting the left-hand side timestamp to MAX_LONG. This is done to get an accurate count of the versions so that it knows which versions to skip. Second, it scans through the versions, skipping anything after the start of the range it was given. This way, you can seek directly to the nth version of a key and maintain a consistent last version. Skipping keys that don't fit in the range works great until we throw in an iterator that transforms keys, modifying their columns.
Enter the WholeRowIterator. The WholeRowIterator groups and encodes all key/value pairs in a row into a single key/value pair to guarantee isolation. This new key looks like: bar : [] 9223372036854775807 false Effectively, we're taking the key in column family "foo" and moving it to column family "". This breaks the second interesting behavior of the VersioningIterator, which will skip over everything that is not in the narrowed range (including this key).
So, the conflict is actually the confluence of the WholeRowIterator, the VersioningIterator, and setting to a single row range with a column filter (resulting in a range that is narrower than one row). This is also not specific to the BatchScanner. If you set the range of your Scanner to (new Range(new Text("bar"))), just like the BatchScanner, the Scanner will display the same behavior.
Cheers, Adam On Mon, Jul 9, 2012 at 12:42 PM, John Armstrong <[EMAIL PROTECTED]> wrote:
> Hi everybody. > > I've run across an unexpected behavior when using WholeRowIterator on a > BatchScanner. In case it matters, we're using cloudbase-1.3.4. > > When I tell it to fetchColumnFamily(new Text("foo")) I get no results > back, though there are definitely records in that column family and in the > row ranges I'm scanning. This doesn't happen when I use a scanner on that > column family, though in that case I'm scanning over the entire table. > > To be more explicit, some constants: > > List<Range> ranges = new ArrayList<Range>(); > ranges.add(new Range(new Text("bar"))); > Text CF = new Text("foo"); > > getNewScanner() and getNewBatchScanner() create scanners for the > appropriate table name, authorization, and number of threads. > > BatchScanner batchScanner = getNewBatchScanner(); > batchScanner.**fetchColumnFamily(CF); > batchScanner.setRanges(ranges)**; > > returns all the entries in row "bar" and column family "foo".
-
Re: WholeRowIterator, BatchScanner, and fetchColumnFamily don't play well together?
John Armstrong 2012-07-09, 19:03
On 07/09/2012 02:59 PM, Adam Fuchs wrote: > This was a fun one, but we figured it out. Thanks for providing code -- > that helped a lot. The quick workaround is to set the priority of the > WholeRowIterator to 21, above the VersioningIterator. Turns out the two > iterators are not commutative, so order matters.
Thanks; that's enough for me to see what must be going on inside given what else I know, but it looks like a really great write-up for anyone else who runs into the same problem.
And I had a feeling that the Scanner might end up doing the same thing, but as I had a workaround already (though yours is clearly a better one) I figured I may as well fire off what I'd seen to the list for any deeper analysis.
Thanks again for the quick work.
|
|