When you say schema, do you mean key schema? If so, why are you repeating
the node id?
Locality groups would help if you have larger swaths of data you wanted to
group together and query discretely from other locality groups. For
instance, I've seen key schemas where "in" and "out" edges are grouped
At a system level, if you know some information about the distribution of
the row values (in this case, it looks like node id and edge id), you can
pre split the table by taking some samples out of that space. This would
distribute the tablets arounds, making queries using the batch scanner
faster by increasing the parallelism. This would also increase the number
of input splits generated by the input format if you wanted to do batch
processing on the entire graph.
On Wed, Nov 6, 2013 at 9:19 AM, Michael Orr <[EMAIL PROTECTED]> wrote:
> I’m working on an application that needs fast read performance. I’ve been
> conducting some experiments starting with a single (pseudo-distributed)
> cluster with the intent of scaling out. However, prior to doing so, I
> wanted to get a good gauge for how fast a single tablet server can read.
> The application processes and stores graph data with the following schema:
> for nodes:
> N|NodeID ID:NodeID EIN:EdgeID
> EOUT:EdgeID .. lots of other attributes
> there can be multiple EIN and EOUT CFs for each node
> for edges
> E|EdgeID ID:NodeID VIN:VertexID
> EOUT:VertexID .. lots of other attributes
> Scans into the system can be for entire graph or a subset of nodes and
> edges. We generally pull navigational information first, then other
> attributes later if needed. I’ve spent some time looking into using
> locality groups but was curious if there are recommendations on backend
> properties that could be set to increase read time particularly if memory
> and space were not a concern.
> Thanks for your help!