Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Intersecting Iterators [SEC=UNCLASSIFIED]


+
Williamson, Luke MR 1 2013-08-14, 01:58
+
David Medinets 2013-08-14, 02:44
+
Williamson, Luke MR 1 2013-08-14, 04:50
Copy link to this message
-
Re: Intersecting Iterators [SEC=UNCLASSIFIED]
Usually the intersecting iterator is used when you're modeling a document
partitioned table. That is, you have relatively few row values compared to
the number of documents you're storing (like, on the order of hundreds to
millions of documents in a single row). It looks like you have a single row
for each document, with field indices stored in the same row as the
document.

What I might suggest is something like:

Row: date
ColumnFamily (a): fi||field||data
ColumnQualifier (a): document-id
ColumnFamily (b): document Id
ColumnQualifier (b): field||data

I believe that having 1:1 mapping between shards/rows and document IDs can
cause significant overhead when it comes to scanning, because it will be
constantly seek'ing within the same RFile blocks.
On Wed, Aug 14, 2013 at 12:50 AM, Williamson, Luke MR 1 <
[EMAIL PROTECTED]> wrote:

> UNCLASSIFIED
>
> I have tried increasing the number of threads and it seems to guarantee
> that it will return before it hits the timeout but it is taking approx. 7
> minutes to complete. Looking at the accumulo manager page it appears that
> all the tablet servers get equally hit (around 16 per node) and start to
> return but a couple of tablet servers take longer than the others. This
> behaviour was indicated to potentially happen in the doco but I was hoping
> it wouldn't be taking this long.
>
> ________________________________
>
> From: David Medinets [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, 14 August 2013 12:45
> To: accumulo-user
> Subject: Re: Intersecting Iterators [SEC=UNCLASSIFIED]
>
>
> I'm wondering about the 20 threads in the BatchScanner. Have you played
> with increasing it? I've seen that number go above 15 per accumulo node.
> Are you seeing the scans in the Accumulo monitor? Are the scans progressing
> through the Accumulo nodes?
>
>
> On Tue, Aug 13, 2013 at 9:58 PM, Williamson, Luke MR 1 <
> [EMAIL PROTECTED]> wrote:
>
>
>         UNCLASSIFIED
>
>         Hi,
>
>         I have field indexes that looks something like
>
>         Row Id: <date>-<UUID>
>         CF: fi||<type>||<value>
>         CQ: <date>-<UUID>
>
>         For example:
>
>         20130814-550e8400-e29b-41d4-a716-446655440000 fi||verb||run
> 20130814-550e8400-e29b-41d4-a716-446655440000
>         20130814-550e8400-e29b-41d4-a716-446655440000 page||58 line||16
> "the boy can run up the hill"
>
>         From what I could determine from the doco and API I am executing
> the following code to perform an intersecting query on two values...
>
>         Set<Range> shards = new HashSet<Range>();
>
>         Text[] terms = {new Text("fi||<type>||<value>"), new
> Text("fi||<type>||<value>")};
>
>         BatchScanner bs = conn.createBatchScanner(table, auths, 20);
> bs.setTimeout(360, TimeUnit.SECONDS);
>
>         IteratorSetting iter = new IteratorSetting(20, "ii",
> IntersectingIterator.class); IntersectingIterator.setColumnFamilies(iter,
> terms); bs.addScanIterator(iter);
>
>         bs.setRanges(Collections.singleton(new Range()));
>
>         for(Entry<Key,Value> entry : bs) {
>
>             shards.add(new Range(entry.getKey().getColumnQualifier()));
>         }
>
>         I then perform a second batch scan using the set of ranges
> returned by the above to get my actual results.
>
>         My issues is that the intersecting query takes several minutes to
> return if at all (in some cases it times out). Is this expected? Is there
> some way to improve performance? Is there a better way to do this sort of
> query?
>
>         Any guidance would be much appreciated.
>
>         Thanks
>
>         Luke
>
>
>         IMPORTANT: This email remains the property of the Department of
> Defence and is subject to the jurisdiction of section 70 of the Crimes Act
> 1914. If you have received this email in error, you are requested to
> contact the sender and delete the email.
>
>
>
>
> IMPORTANT: This email remains the property of the Department of Defence
> and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If