Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Map Reduce with multiple scans


Copy link to this message
-
Re: Map Reduce with multiple scans
Nick, if he didn't specify startKey, endKey in the Scan Object, and
delegate it to a Filter, this means he will send this scan to *all* regions
in the system, instead of just one or two, no?
On Tue, Feb 26, 2013 at 10:12 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:

> Hi Paul,
>
> You want to run multiple scans so that you can filter the previous scan
> results? Am I correct in my understanding of your objective?
>
> First, I suggest you use the PrefixFilter [0] instead of constructing the
> rowkey prefix manually. This looks something like:
>
> byte[] md5Key = Utils.md5( "2013-01-07" );
> Scan scan = new Scan(md5Key);
> scan.setFilter(new PrefixFilter(md5Key));
>
> Yes, that's a bit redundant, but setting the startkey explicitly will save
> you some unnecessary processing.
>
> This map reduce job works fine but this is just one scan job for this map
> > reduce task. What do I have to do to pass multiple scans?
>
>
> Do you mean processing on multiple dates? In that case, what you really
> want is a full (unbounded) table scan. Since date is the first part of your
> compound rowkey, there's no prefix and no need for a filter, just use new
> Scan().
>
> In general, you can use multiple filters in a given Scan (or Get). See the
> FilterList [1] for details.
>
> Does this help?
> Nick
>
> [0]:
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PrefixFilter.html
> [1]:
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html
>
> On Tue, Feb 26, 2013 at 5:41 AM, Paul van Hoven <
> [EMAIL PROTECTED]> wrote:
>
> > My rowkeys look something like this:
> >
> > md5( date ) + md5( ip address )
> >
> > So an example would be
> > md5( "2013-02-08") + md5( "192.168.187.2")
> >
> > For one particular date I got several rows. Now I'd like to query
> > different dates, for example "2013-01-01" and "2013-02-01" and some
> > other. Additionally I'd like to perform this or these scans in a map
> > reduce job.
> >
> > Currently my map reduce job looks like this:
> >
> > Configuration config = HBaseConfiguration.create();
> > Job job = new Job(config,"ToyJob");
> > job.setJarByClass( PlayWithMapReduce.class );
> >
> > byte[] md5Key = Utils.md5( "2013-01-07" );
> > int md5Length = 16;
> > int longLength = 8;
> >
> > byte[] startRow = Bytes.padTail( md5Key, longLength ); //append "0 0 0
> > 0 0 0 0 0"
> > byte[] endRow = Bytes.padTail( md5Key, longLength );
> > endRow[md5Length-1]++; //last byte gets counted up
> >
> > Scan scan = new Scan( startRow, endRow );
> > scan.setCaching(500);
> > scan.setCacheBlocks(false);
> >
> > Filter f = new SingleColumnValueFilter( Bytes.toBytes("CF"),
> > Bytes.toBytes("creativeId"), CompareOp.EQUAL, Bytes.toBytes("100") );
> > scan.setFilter(f);
> >
> > String tableName = "ToyDataTable";
> > TableMapReduceUtil.initTableMapperJob( tableName, scan, Mapper.class,
> > null, null, job);
> >
> > This map reduce job works fine but this is just one scan job for this
> > map reduce task. What do I have to do to pass multiple scans? Or do
> > you have any other suggestions on how to achieve that goal? The
> > constraint would be that it must be possible to combine it with map
> > reduce.
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB