Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> use hbase as distributed crawl's scheduler


Copy link to this message
-
Re: use hbase as distributed crawl's scheduler
Please take a look at our Apache incubator proposal, as I think that may
answer your questions: https://wiki.apache.org/incubator/PhoenixProposal
On Fri, Jan 3, 2014 at 11:47 PM, Li Li <[EMAIL PROTECTED]> wrote:

> so what's the relationship of Phoenix and HBase? something like hadoop and
> hive?
>
>
> On Sat, Jan 4, 2014 at 3:43 PM, James Taylor <[EMAIL PROTECTED]>
> wrote:
> > Hi LiLi,
> > Phoenix isn't an experimental project. We're on our 2.2 release, and many
> > companies (including the company for which I'm employed, Salesforce.com)
> > use it in production today.
> > Thanks,
> > James
> >
> >
> > On Fri, Jan 3, 2014 at 11:39 PM, Li Li <[EMAIL PROTECTED]> wrote:
> >
> >> hi James,
> >>     phoenix seems great but it's now only a experimental project. I
> >> want to use only hbase. could you tell me the difference of Phoenix
> >> and hbase? If I use hbase only, how should I design the schema and
> >> some extra things for my goal? thank you
> >>
> >> On Sat, Jan 4, 2014 at 3:41 AM, James Taylor <[EMAIL PROTECTED]>
> >> wrote:
> >> > On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika <[EMAIL PROTECTED]>
> >> wrote:
> >> >
> >> >> Couple of notes:
> >> >> 1. When updating to status you essentially add a new rowkey into
> HBase,
> >> I
> >> >> would give it up all together. The essential requirement seems to
> point
> >> at
> >> >> retrieving a list of urls in a certain order.
> >> >>
> >> > Not sure on this, but seemed to me that setting the status field is
> >> forcing
> >> > the urls that have been processed to be at the end of the sort order.
> >> >
> >> > 2. Wouldn't salting ruin the sort order required? Priority, date
> added?
> >> >>
> >> > No, as Phoenix maintains returning rows in row key order even when
> >> they're
> >> > salted. We do parallel scans for each bucket and do a merge sort on
> the
> >> > client, so the cost is pretty low for this (we also provide a way of
> >> > turning this off if your use case doesn't need it).
> >> >
> >> > Two years, JM? Now you're really going to have to start using Phoenix
> :-)
> >> >
> >> >
> >> >> On Friday, January 3, 2014, James Taylor wrote:
> >> >>
> >> >> > Sure, no problem. One addition: depending on the cardinality of
> your
> >> >> > priority column, you may want to salt your table to prevent
> >> hotspotting,
> >> >> > since you'll have a monotonically increasing date in the key. To do
> >> that,
> >> >> > just add " SALT_BUCKETS=<n>" on to your query, where <n> is the
> >> number of
> >> >> > machines in your cluster. You can read more about salting here:
> >> >> > http://phoenix.incubator.apache.org/salted.html
> >> >> >
> >> >> >
> >> >> > On Thu, Jan 2, 2014 at 11:36 PM, Li Li <[EMAIL PROTECTED]>
> wrote:
> >> >> >
> >> >> > > thank you. it's great.
> >> >> > >
> >> >> > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <
> >> [EMAIL PROTECTED]>
> >> >> > > wrote:
> >> >> > > > Hi LiLi,
> >> >> > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/).
> >> It's
> >> >> a
> >> >> > > SQL
> >> >> > > > skin on top of HBase. You can model your schema and issue your
> >> >> queries
> >> >> > > just
> >> >> > > > like you would with MySQL. Something like this:
> >> >> > > >
> >> >> > > > // Create table that optimizes for your most common query
> >> >> > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd
> >> want
> >> >> > your
> >> >> > > > rows ordered)
> >> >> > > > CREATE TABLE url_db (
> >> >> > > >     status TINYINT,
> >> >> > > >     priority INTEGER NOT NULL,
> >> >> > > >     added_time DATE,
> >> >> > > >     url VARCHAR NOT NULL
> >> >> > > >     CONSTRAINT pk PRIMARY KEY (status, priority, added_time,
> >> url));
> >> >> > > >
> >> >> > > > int lastStatus = 0;
> >> >> > > > int lastPriority = 0;
> >> >> > > > Date lastAddedTime = new Date(0);
> >> >> > > > String lastUrl = "";
> >> >> > > >
> >> >> > > > while (true) {
> >> >> > > >     // Use row value constructor to page through results in
> >> batches
> >