Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase parallel scanner performance


Copy link to this message
-
Re: HBase parallel scanner performance
great thread for a real world problem.

Michael, it sounds like the initial design was more of a traditional db
solution, whereas with hbase (and nosql in general) the design is to
denormalize and build your row/cf structure to fit the use case.  Disks are
cheap, writes are fast, so build your index in order to scan for the
results you need.

On Thu, Apr 19, 2012 at 2:33 PM, Michael Segel <[EMAIL PROTECTED]>wrote:

> No problem.
>
> One of the hardest things to do is to try to be open to other design ideas
> and not become wedded to one.
>
> I think once you get that working you can start to look at your cluster.
>
> On Apr 19, 2012, at 1:26 PM, Narendra yadala wrote:
>
> > Michael,
> >
> > I will do the redesign and build the index. Thanks a lot for the
> insights.
> >
> > Narendra
> >
> > On Thu, Apr 19, 2012 at 9:56 PM, Michael Segel <
> [EMAIL PROTECTED]>wrote:
> >
> >> Narendra,
> >>
> >> I think you are still missing the point.
> >> 130 seconds to scan the table per iteration.
> >> Even if you have 10K rows
> >> 130 * 10^4 or 1.3*10^6 seconds.  ~361 hours
> >>
> >> Compare that to 10K rows where you then select a single row in your sub
> >> select that has a list of all of the associated rows.
> >> You can then do  n number of get()s based on the data in the index. (If
> >> the data wasn't in the index itself)
> >>
> >> Assuming that the data was in the index, that's one get(). This is sub
> >> second.
> >> Just to keep things simple assume 1 second.
> >> That's 10K seconds vs 1.3 million seconds.  (2 hours vs 361hours)
> >> Actually its more like 10ms  so its 100 seconds to run your code.  (So
> its
> >> like 2 minutes or so)
> >>
> >> Also since you're doing less work, you put less strain on the system.
> >>
> >> Look, you're asking for help. You're fighting to maintain a bad design.
> >> Building the index table shouldn't take you more than a day to think,
> >> design and implement.
> >>
> >> So you tell me, 2 minutes vs 361 hours. Which would you choose?
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >>
> >> On Apr 19, 2012, at 10:04 AM, Narendra yadala wrote:
> >>
> >>> Michael,
> >>>
> >>> Thanks for the response. This is a real problem and not a class
> project.
> >>> Boxes itself costed 9k ;)
> >>>
> >>> I think there is some difference in understanding of the problem. The
> >> table
> >>> has 2m rows but I am looking at the latest 10k rows only in the outer
> for
> >>> loop. Only in the inner for loop i am trying to get all rows that
> contain
> >>> the url that is given by the row in the outer for loop. So pseudo code
> is
> >>> like this
> >>>
> >>> All scanners have a caching of 128.
> >>>
> >>> Scanner outerScanner =  tweetTable.getScanner(new Scan()); //This gets
> >> the
> >>> entire row
> >>> for (int index = 0; index < 10000; index++) {
> >>> Result tweet =  outerScanner.next();
> >>> NavigableMap<byte[],byte[]> linkFamilyMap > >>> tweet.getFamilyMap(Bytes.toBytes("link"));
> >>> String url = Bytes.toString( linkFamilyMap.firstKey());  //assuming
> only
> >>> one link is there in the tweet.
> >>> Scan linkScan = new Scan();
> >>> linkScan.addColumn(Bytes.toBytes("link"), Bytes.toBytes(url)); //get
> only
> >>> the link column family
> >>> Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this
> for
> >>> loop is taking 2 sec per sc
> >>> for (Result linkResult = linkScanner.next(); linkResult != null;
> >>> linkResult = linkScanner.next()) {
> >>>   //do something with the link
> >>> }
> >>> linkScanner.close();
> >>>
> >>>       //do a similar for loop for hashtags
> >>> }
> >>>
> >>> Each of my inner for loop is taking around 20 seconds (or more
> depending
> >> on
> >>> number of rows returned by that particular scanner) for each of the 10k
> >>> rows that I am processing and this is also triggering a lot of GC in
> >> turn.
> >>> So it is 10000*40 seconds (4 days) for each thread. But the problem is
> >> that
> >>> the batch process crashes before completion throwing IOException and