Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> How to efficiently compute a list-ranking of rows in Accumulo?


Copy link to this message
-
Re: How to efficiently compute a list-ranking of rows in Accumulo?
I think you can do this in one map round, with something like the following.

class MyMapper{
  Text mapperId; // an id that uniquley identifies this mapper
  Text firstRow;
  long count = 0;

  //map called once per row
  void mapper(Text row,  Context context){

     Mutation m = new Mutation(row);
     m.put("count",mapperId, count.toString());
     count++;

     context.write(m);

     if(firstRow == null){
        firstRow = row;
     }

  }

  close(){
     //write to a secondary table that keeps a little bit of info
about each map range
     BatchWriter bw = conn.createBatchWriter("mapInfoTable");
     Mutation m = new Mutation(mapperId);

     m.put("rangeInfo","totalCount",count.toString());
     m.put("rangeInfo","firstRow", firstRow);

    bw.add(m);
    bw.close();
  }
}
After running the above you can do some post processing on the
mapInfoTable and determine what the absolute offset of each mapper
was.  When you scan the table you can lookup up the absolute offset of
a mapper id and add this to the relative count stored in the table.
You can cache the lookups into the mapInfoTable.  This approach makes
no assumptions about underlying tablets.  Its completely based on the
ranges used to start mappers.

Keith

On Fri, Oct 12, 2012 at 11:38 AM, Jeff Kubina <[EMAIL PROTECTED]> wrote:
> I have an Accumulo table and I want to efficiently compute a
> list-ranking of the rowIds, more precisely, let r(i) for i=0 to n-1 be
> the ordered (sorted) rowids in the table, I want to create a column,
> say numId, such that for each row r(i) the value in column numId is i.
>
> I would like to do this in one map round. To do so I need to get the
> total number of rows in each tablet and an id for the tablet that has
> an ordering (sorting) consistent with the rowIds ordering. Assume
> there are t tablets and (tabId(j),size(j)) for j=0 to t-1 is the
> tablet id and row size, then I can compute a prefix sum of size(j),
> such as (tabId(0), 0), (tabId(1), size(0), (tabId(2),
> size(0)+size(1)), and so on; let offset(j) be the offset computed for
> tabId(j). So I can then use t mappers to set the numId in the rows to
> offset(j), offset(j) + 1, …
>
> I can do all of this in two map rounds by using the first rowId in
> each tablet as the tabId and simply counting the number of rows, but I
> was hoping to avoid having to do this.
>
> Can I get the size of each tablet and an id for it consistent with the
> row ordering? If so, how?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB