|
|
Hi.
I am thinking about creating a Direct Reader for Accumulo.
A library which has API compatible with the Accumulo client but reads .rf-files directly from HDFS, bypassing tservers.
Motivation is:
1. To have a possibility to quickly read stalled data when the tserver is busy (with re-balancing, reading logs, etc) or just went down and its tablets are not redistributed yet.
2. If the table is read-only or can afford eventual consistency, many readers can work in parallel with no bottleneck of tserver. Also, the table's data becomes local on three (number of HDFS replicas) servers instead of one.
3. Distribution of data: analytics can download .rf-files (even to a laptop) and run their software locally.
Any suggestions ?
Thanks.
-
Re: Accumulo Direct Reader
Marc Parisi 2012-10-17, 14:03
RFileOperations.getInstance() will return an instance of FileOperations, which will allow you to call the open reader method and open any arbitrary r file. The issue might be locating the r files, which are part of a given row; however, this would be quit simple by going through the Metadata table and looking for the rfiles associated with that given tablet. By doing this you can bypass the entire iterator stack. I have an example of this on my github, but in reality, those methods I mentioned above are all you really need.
On Wed, Oct 17, 2012 at 9:46 AM, Denis <[EMAIL PROTECTED]> wrote:
> Hi. > > I am thinking about creating a Direct Reader for Accumulo. > > A library which has API compatible with the Accumulo client but > reads .rf-files directly from HDFS, bypassing tservers. > > Motivation is: > > 1. To have a possibility to quickly read stalled data when the > tserver is busy (with re-balancing, reading logs, etc) or just went > down and its tablets are not redistributed yet. > > 2. If the table is read-only or can afford eventual consistency, > many readers can work in parallel with no bottleneck of tserver. Also, > the table's data becomes local on three (number of HDFS replicas) > servers instead of one. > > 3. Distribution of data: analytics can download .rf-files (even to > a laptop) and run their software locally. > > Any suggestions ? > > Thanks. >
-
Re: Accumulo Direct Reader
Eric Newton 2012-10-17, 14:57
See InputFormatBase#setScanOffline.
Clone a table, take it offline and then use it as your map/reduce input format. This will preserve a consistent view of the underlying files, without going through the tablet servers.
-Eric
On Wed, Oct 17, 2012 at 9:46 AM, Denis <[EMAIL PROTECTED]> wrote: > Hi. > > I am thinking about creating a Direct Reader for Accumulo. > > A library which has API compatible with the Accumulo client but > reads .rf-files directly from HDFS, bypassing tservers. > > Motivation is: > > 1. To have a possibility to quickly read stalled data when the > tserver is busy (with re-balancing, reading logs, etc) or just went > down and its tablets are not redistributed yet. > > 2. If the table is read-only or can afford eventual consistency, > many readers can work in parallel with no bottleneck of tserver. Also, > the table's data becomes local on three (number of HDFS replicas) > servers instead of one. > > 3. Distribution of data: analytics can download .rf-files (even to > a laptop) and run their software locally. > > Any suggestions ? > > Thanks.
-
Re: Accumulo Direct Reader
Keith Turner 2012-10-17, 15:13
On Wed, Oct 17, 2012 at 10:57 AM, Eric Newton <[EMAIL PROTECTED]> wrote: > See InputFormatBase#setScanOffline.
This uses o.a.a.c.client.impl.OfflineScanner. OfflineScanner will scan an offline table by going directly to the files. It does the exact same thing the tablet server does when reading a tablets files. I was thinking of making OfflineScanner available through Connector somehow when adding setScanOffline to M/R code, but did not for some reason. If there is interest we could revisit this.
> > Clone a table, take it offline and then use it as your map/reduce > input format. This will preserve a consistent view of the underlying > files, without going through the tablet servers. > > -Eric > > On Wed, Oct 17, 2012 at 9:46 AM, Denis <[EMAIL PROTECTED]> wrote: >> Hi. >> >> I am thinking about creating a Direct Reader for Accumulo. >> >> A library which has API compatible with the Accumulo client but >> reads .rf-files directly from HDFS, bypassing tservers. >> >> Motivation is: >> >> 1. To have a possibility to quickly read stalled data when the >> tserver is busy (with re-balancing, reading logs, etc) or just went >> down and its tablets are not redistributed yet. >> >> 2. If the table is read-only or can afford eventual consistency, >> many readers can work in parallel with no bottleneck of tserver. Also, >> the table's data becomes local on three (number of HDFS replicas) >> servers instead of one. >> >> 3. Distribution of data: analytics can download .rf-files (even to >> a laptop) and run their software locally. >> >> Any suggestions ? >> >> Thanks.
|
|