Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - Re: [ANN]: HBaseWD: Distribute Sequential Writes in HBase


Copy link to this message
-
Re: [ANN]: HBaseWD: Distribute Sequential Writes in HBase
Alex Baranau 2011-05-13, 20:12
Thanks for the interest!

We are using it in production. It is simple and hence quite stable. Though
some minor pieces are missing (like
https://github.com/sematext/HBaseWD/issues/1) this doesn't affect stability
and/or major functionality.

Alex Baranau
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase

On Fri, May 13, 2011 at 10:45 AM, Weishung Chung <[EMAIL PROTECTED]> wrote:

> What's the status on this package? Is it mature enough?
>  I am using it in my project, tried out the write method yesterday and
> going
> to incorporate into read method tomorrow.
>
> On Wed, May 11, 2011 at 3:41 PM, Alex Baranau <[EMAIL PROTECTED]
> >wrote:
>
> > > The start/end rows may be written twice.
> >
> > Yeah, I know. I meant that size of startRow+stopRow data is "bearable" in
> > attribute value no matter how long are they (keys), since we already OK
> > with
> > transferring them initially (i.e. we should be OK with transferring 2x
> > times
> > more).
> >
> > So, what about the suggestion of sourceScan attribute value I mentioned?
> If
> > you can tell why it isn't sufficient in your case, I'd have more info to
> > think about better suggestion ;)
> >
> > > It is Okay to keep all versions of your patch in the JIRA.
> > > Maybe the second should be named HBASE-3811-v2.patch<
> >
> https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch
> > >?
> >
> > np. Can do that. Just thought that they (patches) can be sorted by date
> to
> > find out the final one (aka "convention over naming-rules").
> >
> > Alex.
> >
> > On Wed, May 11, 2011 at 11:13 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> >
> > > >> Though it might be ok, since we anyways "transfer" start/stop rows
> > with
> > > Scan object.
> > > In write() method, we now have:
> > >     Bytes.writeByteArray(out, this.startRow);
> > >     Bytes.writeByteArray(out, this.stopRow);
> > > ...
> > >       for (Map.Entry<String, byte[]> attr : this.attributes.entrySet())
> {
> > >         WritableUtils.writeString(out, attr.getKey());
> > >         Bytes.writeByteArray(out, attr.getValue());
> > >       }
> > > The start/end rows may be written twice.
> > >
> > > Of course, you have full control over how to generate the unique ID for
> > > "sourceScan" attribute.
> > >
> > > It is Okay to keep all versions of your patch in the JIRA. Maybe the
> > second
> > > should be named HBASE-3811-v2.patch<
> >
> https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch
> > >?
> > >
> > > Thanks
> > >
> > >
> > > On Wed, May 11, 2011 at 1:01 PM, Alex Baranau <
> [EMAIL PROTECTED]
> > >wrote:
> > >
> > >> > Can you remove the first version ?
> > >> Isn't it ok to keep it in JIRA issue?
> > >>
> > >>
> > >> > In HBaseWD, can you use reflection to detect whether Scan supports
> > >> setAttribute() ?
> > >> > If it does, can you encode start row and end row as "sourceScan"
> > >> attribute ?
> > >>
> > >> Yeah, smth like this is going to be implemented. Though I'd still want
> > to
> > >> hear from the devs the story about Scan version.
> > >>
> > >>
> > >> > One consideration is that start row or end row may be quite long.
> > >>
> > >> Yeah, that is was my though too at first. Though it might be ok, since
> > we
> > >> anyways "transfer" start/stop rows with Scan object.
> > >>
> > >> > What do you think ?
> > >>
> > >> I'd love to hear from you is this variant I mentioned is what we are
> > >> looking at here:
> > >>
> > >>
> > >> > From what I understand, you want to distinguish scans fired by the
> > same
> > >> distributed scan.
> > >> > I.e. group scans which were fired by single distributed scan. If
> > that's
> > >> what you want, distributed
> > >> > scan can generate unique ID and set, say "sourceScan" attribute to
> its
> > >> value. This way we'll
> > >> > have <# of distinct "sourceScan" attribute values> = <number of
> > >> distributed scans invoked by
> > >> > client side> and two scans on server side will have the same