|
|
-
Re: Large-scale web analytics with Accumulo (and Nutch/Gora, Pig, and Storm)Jason Trost 2012-11-04, 17:34
Keith,
In Nutch/GORA, each job specifies which fields should be filled in when reading data from the datastore (standard database projection). For many of these jobs, full table scans are performed but most of the fields are not read. Because of this, if you don't have locality groups setup, accumulo will have to scan through all the data. For example, the CONTENT field is one of the largest fields but is really only retrieved during the ParseJob. There are typically many OUTLINKS and INLINKS and they are only used in DbUpdaterJob and the ParseJob. I believe putting the CONTENT and HEADERS in their own locality group provides the biggest gain, but separating INLINKS and OUTLINKS from the others also helps. See below for the fields and the jobs (Note: some jobs use fields from the others, but this is the basic fields per job). InjectorJob.java: FIELDS.add(WebPage.Field.MARKERS); FIELDS.add(WebPage.Field.STATUS); DbUpdaterJob.java FIELDS.add(WebPage.Field.OUTLINKS); FIELDS.add(WebPage.Field.INLINKS); FIELDS.add(WebPage.Field.STATUS); FIELDS.add(WebPage.Field.PREV_SIGNATURE); FIELDS.add(WebPage.Field.SIGNATURE); FIELDS.add(WebPage.Field.MARKERS); FIELDS.add(WebPage.Field.METADATA); FIELDS.add(WebPage.Field.RETRIES_SINCE_FETCH); FIELDS.add(WebPage.Field.FETCH_TIME); FIELDS.add(WebPage.Field.MODIFIED_TIME); FIELDS.add(WebPage.Field.FETCH_INTERVAL); FIELDS.add(WebPage.Field.PREV_FETCH_TIME); GeneratorJob.java FIELDS.add(WebPage.Field.FETCH_TIME); FIELDS.add(WebPage.Field.SCORE); FIELDS.add(WebPage.Field.STATUS); FIELDS.add(WebPage.Field.MARKERS); IndexerJob.java FIELDS.add(WebPage.Field.SIGNATURE); FIELDS.add(WebPage.Field.PARSE_STATUS); FIELDS.add(WebPage.Field.SCORE); FIELDS.add(WebPage.Field.MARKERS); FetcherJob.java FIELDS.add(WebPage.Field.MARKERS); FIELDS.add(WebPage.Field.REPR_URL); FIELDS.add(WebPage.Field.FETCH_TIME); ParserJob.java: FIELDS.add(WebPage.Field.STATUS); FIELDS.add(WebPage.Field.CONTENT); FIELDS.add(WebPage.Field.CONTENT_TYPE); FIELDS.add(WebPage.Field.SIGNATURE); FIELDS.add(WebPage.Field.MARKERS); FIELDS.add(WebPage.Field.PARSE_STATUS); FIELDS.add(WebPage.Field.OUTLINKS); FIELDS.add(WebPage.Field.METADATA); FIELDS.add(WebPage.Field.HEADERS); Yeah, we use supervisord pretty heavily for most services that we deploy. We found that many (well designed) services fail fast and they recover when restarted, so for the most part, this works pretty well. You definitely need to keep an eye on the supervisord logs to see which services are failing, how frequent, and why. Thanks, --Jason On Sat, Nov 3, 2012 at 4:29 PM, Keith Turner <[EMAIL PROTECTED]> wrote: > On Fri, Nov 2, 2012 at 9:43 PM, Jason Trost <[EMAIL PROTECTED]> wrote: > > I gave this talk at an Accumulo Meetup group co-located with > > StrataConf/Hadoop World in NYC. I thought you all might be interested. > > > > Large-scale web analytics with Accumulo (and Nutch/Gora, Pig, and Storm) > > http://www.slideshare.net/jasontrost/accumulo-at-endgame > > > > Let me know if you have any questions. > > Are you running all hadoop, zookeeper, accumulo, etc processes under > supervisord? I have not heard of that. I just took a quick look at > their web page, it looks interesting. > > I am curious about the bullet "Locality groups are your friend for > Nutch/Gora". Can you elaborate? > > > > > --Jason > > > |