Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> use hbase as distributed crawl's scheduler


Copy link to this message
-
Re: use hbase as distributed crawl's scheduler
Please take a look at our Apache incubator proposal, as I think that may
answer your questions: https://wiki.apache.org/incubator/PhoenixProposal
On Fri, Jan 3, 2014 at 11:47 PM, Li Li <[EMAIL PROTECTED]> wrote:

> so what's the relationship of Phoenix and HBase? something like hadoop and
> hive?
>
>
> On Sat, Jan 4, 2014 at 3:43 PM, James Taylor <[EMAIL PROTECTED]>
> wrote:
> > Hi LiLi,
> > Phoenix isn't an experimental project. We're on our 2.2 release, and many
> > companies (including the company for which I'm employed, Salesforce.com)
> > use it in production today.
> > Thanks,
> > James
> >
> >
> > On Fri, Jan 3, 2014 at 11:39 PM, Li Li <[EMAIL PROTECTED]> wrote:
> >
> >> hi James,
> >>     phoenix seems great but it's now only a experimental project. I
> >> want to use only hbase. could you tell me the difference of Phoenix
> >> and hbase? If I use hbase only, how should I design the schema and
> >> some extra things for my goal? thank you
> >>
> >> On Sat, Jan 4, 2014 at 3:41 AM, James Taylor <[EMAIL PROTECTED]>
> >> wrote:
> >> > On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika <[EMAIL PROTECTED]>
> >> wrote:
> >> >
> >> >> Couple of notes:
> >> >> 1. When updating to status you essentially add a new rowkey into
> HBase,
> >> I
> >> >> would give it up all together. The essential requirement seems to
> point
> >> at
> >> >> retrieving a list of urls in a certain order.
> >> >>
> >> > Not sure on this, but seemed to me that setting the status field is
> >> forcing
> >> > the urls that have been processed to be at the end of the sort order.
> >> >
> >> > 2. Wouldn't salting ruin the sort order required? Priority, date
> added?
> >> >>
> >> > No, as Phoenix maintains returning rows in row key order even when
> >> they're
> >> > salted. We do parallel scans for each bucket and do a merge sort on
> the
> >> > client, so the cost is pretty low for this (we also provide a way of
> >> > turning this off if your use case doesn't need it).
> >> >
> >> > Two years, JM? Now you're really going to have to start using Phoenix
> :-)
> >> >
> >> >
> >> >> On Friday, January 3, 2014, James Taylor wrote:
> >> >>
> >> >> > Sure, no problem. One addition: depending on the cardinality of
> your
> >> >> > priority column, you may want to salt your table to prevent
> >> hotspotting,
> >> >> > since you'll have a monotonically increasing date in the key. To do
> >> that,
> >> >> > just add " SALT_BUCKETS=<n>" on to your query, where <n> is the
> >> number of
> >> >> > machines in your cluster. You can read more about salting here:
> >> >> > http://phoenix.incubator.apache.org/salted.html
> >> >> >
> >> >> >
> >> >> > On Thu, Jan 2, 2014 at 11:36 PM, Li Li <[EMAIL PROTECTED]>
> wrote:
> >> >> >
> >> >> > > thank you. it's great.
> >> >> > >
> >> >> > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <
> >> [EMAIL PROTECTED]>
> >> >> > > wrote:
> >> >> > > > Hi LiLi,
> >> >> > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/).
> >> It's
> >> >> a
> >> >> > > SQL
> >> >> > > > skin on top of HBase. You can model your schema and issue your
> >> >> queries
> >> >> > > just
> >> >> > > > like you would with MySQL. Something like this:
> >> >> > > >
> >> >> > > > // Create table that optimizes for your most common query
> >> >> > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd
> >> want
> >> >> > your
> >> >> > > > rows ordered)
> >> >> > > > CREATE TABLE url_db (
> >> >> > > >     status TINYINT,
> >> >> > > >     priority INTEGER NOT NULL,
> >> >> > > >     added_time DATE,
> >> >> > > >     url VARCHAR NOT NULL
> >> >> > > >     CONSTRAINT pk PRIMARY KEY (status, priority, added_time,
> >> url));
> >> >> > > >
> >> >> > > > int lastStatus = 0;
> >> >> > > > int lastPriority = 0;
> >> >> > > > Date lastAddedTime = new Date(0);
> >> >> > > > String lastUrl = "";
> >> >> > > >
> >> >> > > > while (true) {
> >> >> > > >     // Use row value constructor to page through results in
> >> batches
> >
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB