Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Map Reduce with multiple scans


+
Paul van Hoven 2013-02-26, 13:41
+
Stack 2013-02-26, 20:12
+
Nick Dimiduk 2013-02-26, 20:12
+
Paul van Hoven 2013-02-27, 16:17
+
Enis Söztutar 2013-02-28, 02:57
Copy link to this message
-
Re: Map Reduce with multiple scans
Nick, if he didn't specify startKey, endKey in the Scan Object, and
delegate it to a Filter, this means he will send this scan to *all* regions
in the system, instead of just one or two, no?
On Tue, Feb 26, 2013 at 10:12 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:

> Hi Paul,
>
> You want to run multiple scans so that you can filter the previous scan
> results? Am I correct in my understanding of your objective?
>
> First, I suggest you use the PrefixFilter [0] instead of constructing the
> rowkey prefix manually. This looks something like:
>
> byte[] md5Key = Utils.md5( "2013-01-07" );
> Scan scan = new Scan(md5Key);
> scan.setFilter(new PrefixFilter(md5Key));
>
> Yes, that's a bit redundant, but setting the startkey explicitly will save
> you some unnecessary processing.
>
> This map reduce job works fine but this is just one scan job for this map
> > reduce task. What do I have to do to pass multiple scans?
>
>
> Do you mean processing on multiple dates? In that case, what you really
> want is a full (unbounded) table scan. Since date is the first part of your
> compound rowkey, there's no prefix and no need for a filter, just use new
> Scan().
>
> In general, you can use multiple filters in a given Scan (or Get). See the
> FilterList [1] for details.
>
> Does this help?
> Nick
>
> [0]:
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PrefixFilter.html
> [1]:
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html
>
> On Tue, Feb 26, 2013 at 5:41 AM, Paul van Hoven <
> [EMAIL PROTECTED]> wrote:
>
> > My rowkeys look something like this:
> >
> > md5( date ) + md5( ip address )
> >
> > So an example would be
> > md5( "2013-02-08") + md5( "192.168.187.2")
> >
> > For one particular date I got several rows. Now I'd like to query
> > different dates, for example "2013-01-01" and "2013-02-01" and some
> > other. Additionally I'd like to perform this or these scans in a map
> > reduce job.
> >
> > Currently my map reduce job looks like this:
> >
> > Configuration config = HBaseConfiguration.create();
> > Job job = new Job(config,"ToyJob");
> > job.setJarByClass( PlayWithMapReduce.class );
> >
> > byte[] md5Key = Utils.md5( "2013-01-07" );
> > int md5Length = 16;
> > int longLength = 8;
> >
> > byte[] startRow = Bytes.padTail( md5Key, longLength ); //append "0 0 0
> > 0 0 0 0 0"
> > byte[] endRow = Bytes.padTail( md5Key, longLength );
> > endRow[md5Length-1]++; //last byte gets counted up
> >
> > Scan scan = new Scan( startRow, endRow );
> > scan.setCaching(500);
> > scan.setCacheBlocks(false);
> >
> > Filter f = new SingleColumnValueFilter( Bytes.toBytes("CF"),
> > Bytes.toBytes("creativeId"), CompareOp.EQUAL, Bytes.toBytes("100") );
> > scan.setFilter(f);
> >
> > String tableName = "ToyDataTable";
> > TableMapReduceUtil.initTableMapperJob( tableName, scan, Mapper.class,
> > null, null, job);
> >
> > This map reduce job works fine but this is just one scan job for this
> > map reduce task. What do I have to do to pass multiple scans? Or do
> > you have any other suggestions on how to achieve that goal? The
> > constraint would be that it must be possible to combine it with map
> > reduce.
> >
>