Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Large-scale web analytics with Accumulo (and Nutch/Gora, Pig, and Storm)


+
Jason Trost 2012-11-03, 01:43
+
David Medinets 2012-11-03, 03:45
+
David Medinets 2012-11-03, 03:47
+
Jason Trost 2012-11-03, 12:36
+
Jason Trost 2012-11-03, 12:35
+
Keith Turner 2012-11-03, 20:29
Copy link to this message
-
Re: Large-scale web analytics with Accumulo (and Nutch/Gora, Pig, and Storm)
Keith,

In Nutch/GORA, each job specifies which fields should be filled in when
reading data from the datastore (standard database projection).

For many of these jobs, full table scans are performed but most of the
fields are not read.  Because of this, if you don't have locality groups
setup, accumulo will have to scan through all the data.  For example, the
CONTENT field is one of the largest fields but is really only retrieved
during the ParseJob.  There are typically many OUTLINKS and INLINKS and
they are only used in DbUpdaterJob and the ParseJob.

I believe putting the CONTENT and HEADERS in their own locality group
provides the biggest gain, but separating INLINKS and OUTLINKS from the
others also helps.

See below for the fields and the jobs (Note: some jobs use fields from the
others, but this is the basic fields per job).

InjectorJob.java:
    FIELDS.add(WebPage.Field.MARKERS);
    FIELDS.add(WebPage.Field.STATUS);

DbUpdaterJob.java
    FIELDS.add(WebPage.Field.OUTLINKS);
    FIELDS.add(WebPage.Field.INLINKS);
    FIELDS.add(WebPage.Field.STATUS);
    FIELDS.add(WebPage.Field.PREV_SIGNATURE);
    FIELDS.add(WebPage.Field.SIGNATURE);
    FIELDS.add(WebPage.Field.MARKERS);
    FIELDS.add(WebPage.Field.METADATA);
    FIELDS.add(WebPage.Field.RETRIES_SINCE_FETCH);
    FIELDS.add(WebPage.Field.FETCH_TIME);
    FIELDS.add(WebPage.Field.MODIFIED_TIME);
    FIELDS.add(WebPage.Field.FETCH_INTERVAL);
    FIELDS.add(WebPage.Field.PREV_FETCH_TIME);

GeneratorJob.java
    FIELDS.add(WebPage.Field.FETCH_TIME);
    FIELDS.add(WebPage.Field.SCORE);
    FIELDS.add(WebPage.Field.STATUS);
    FIELDS.add(WebPage.Field.MARKERS);

IndexerJob.java
    FIELDS.add(WebPage.Field.SIGNATURE);
    FIELDS.add(WebPage.Field.PARSE_STATUS);
    FIELDS.add(WebPage.Field.SCORE);
    FIELDS.add(WebPage.Field.MARKERS);

FetcherJob.java
    FIELDS.add(WebPage.Field.MARKERS);
    FIELDS.add(WebPage.Field.REPR_URL);
    FIELDS.add(WebPage.Field.FETCH_TIME);

ParserJob.java:
    FIELDS.add(WebPage.Field.STATUS);
    FIELDS.add(WebPage.Field.CONTENT);
    FIELDS.add(WebPage.Field.CONTENT_TYPE);
    FIELDS.add(WebPage.Field.SIGNATURE);
    FIELDS.add(WebPage.Field.MARKERS);
    FIELDS.add(WebPage.Field.PARSE_STATUS);
    FIELDS.add(WebPage.Field.OUTLINKS);
    FIELDS.add(WebPage.Field.METADATA);
    FIELDS.add(WebPage.Field.HEADERS);

Yeah, we use supervisord pretty heavily for most services that we deploy.
 We found that many (well designed) services fail fast and they recover
when restarted, so for the most part, this works pretty well.  You
definitely need to keep an eye on the supervisord logs to see which
services are failing, how frequent, and why.

Thanks,

--Jason

On Sat, Nov 3, 2012 at 4:29 PM, Keith Turner <[EMAIL PROTECTED]> wrote:

> On Fri, Nov 2, 2012 at 9:43 PM, Jason Trost <[EMAIL PROTECTED]> wrote:
> > I gave this talk at an Accumulo Meetup group co-located with
> > StrataConf/Hadoop World in NYC.  I thought you all might be interested.
> >
> > Large-scale web analytics with Accumulo (and Nutch/Gora, Pig, and Storm)
> > http://www.slideshare.net/jasontrost/accumulo-at-endgame
> >
> > Let me know if you have any questions.
>
> Are you running all hadoop, zookeeper, accumulo, etc processes under
> supervisord?  I have not heard of that.  I just took a quick look at
> their web page, it looks interesting.
>
> I am curious about the bullet "Locality groups are your friend for
> Nutch/Gora".   Can you elaborate?
>
> >
> > --Jason
> >
>