


What is the Communication and Time Complexity for Bulk Inserts?
Jeff Kubina 20121018, 14:49
I am deriving the time complexities for an algorithm I implemented in Hadoop using Accumulo and need to know the time complexity of bulk inserting m records evenly distributed across p nodes into an empty table with p tablet servers. Assuming B is the bandwidth of the network, would the communication complexity be O(m/B) and the computation complexity O(m/p * log(m/p))? If the table contained n records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
+
Jeff Kubina 20121018, 14:49

Re: What is the Communication and Time Complexity for Bulk Inserts?
Josh Elser 20121018, 15:20
Are you referring to "bulk inserts" as importing a presorted rfile of Key/Values or usinga BatchWriter?
On 10/18/12 10:49 AM, Jeff Kubina wrote: > I am deriving the time complexities for an algorithm I implemented in > Hadoop using Accumulo and need to know the time complexity of bulk > inserting m records evenly distributed across p nodes into an empty > table with p tablet servers. Assuming B is the bandwidth of the > network, would the communication complexity be O(m/B) and the > computation complexity O(m/p * log(m/p))? If the table contained n > records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
+
Josh Elser 20121018, 15:20

Re: What is the Communication and Time Complexity for Bulk Inserts?
Jeff Kubina 20121018, 15:37
BatchWriter, but I would be interested in the answer assuming a presorted rfile.
On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> wrote: > Are you referring to "bulk inserts" as importing a presorted rfile of > Key/Values or usinga BatchWriter? > > On 10/18/12 10:49 AM, Jeff Kubina wrote: >> >> I am deriving the time complexities for an algorithm I implemented in >> Hadoop using Accumulo and need to know the time complexity of bulk >> inserting m records evenly distributed across p nodes into an empty >> table with p tablet servers. Assuming B is the bandwidth of the >> network, would the communication complexity be O(m/B) and the >> computation complexity O(m/p * log(m/p))? If the table contained n >> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
+
Jeff Kubina 20121018, 15:37

Re: What is the Communication and Time Complexity for Bulk Inserts?
Eric Newton 20121024, 18:45
Adding a sorted file to accumulo (bulk loading) is essentially constant in the normal case. It is O(n) + O(p) for the worst case where the index must be read, and the file assigned to every tablet server. In this case, the (slow) RPCs will dominate over the (fast) read of the index, except for very small clusters or very large indexes.
Inserting with the BatchWriter is eventually dominated by compactions, which is a merge sort, or O(n log n).
Eric
On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]> wrote: > BatchWriter, but I would be interested in the answer assuming a > presorted rfile. > > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> wrote: >> Are you referring to "bulk inserts" as importing a presorted rfile of >> Key/Values or usinga BatchWriter? >> >> On 10/18/12 10:49 AM, Jeff Kubina wrote: >>> >>> I am deriving the time complexities for an algorithm I implemented in >>> Hadoop using Accumulo and need to know the time complexity of bulk >>> inserting m records evenly distributed across p nodes into an empty >>> table with p tablet servers. Assuming B is the bandwidth of the >>> network, would the communication complexity be O(m/B) and the >>> computation complexity O(m/p * log(m/p))? If the table contained n >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
+
Eric Newton 20121024, 18:45

Re: What is the Communication and Time Complexity for Bulk Inserts?
Jeff Kubina 20121024, 19:57
@eric, assuming the records are evenly distributed and network bandwidth is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))?
On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[EMAIL PROTECTED]> wrote:
> Adding a sorted file to accumulo (bulk loading) is essentially > constant in the normal case. It is O(n) + O(p) for the worst case > where the index must be read, and the file assigned to every tablet > server. In this case, the (slow) RPCs will dominate over the (fast) > read of the index, except for very small clusters or very large > indexes. > > Inserting with the BatchWriter is eventually dominated by compactions, > which is a merge sort, or O(n log n). > > Eric > > On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]> > wrote: > > BatchWriter, but I would be interested in the answer assuming a > > presorted rfile. > > > > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> > wrote: > >> Are you referring to "bulk inserts" as importing a presorted rfile of > >> Key/Values or usinga BatchWriter? > >> > >> On 10/18/12 10:49 AM, Jeff Kubina wrote: > >>> > >>> I am deriving the time complexities for an algorithm I implemented in > >>> Hadoop using Accumulo and need to know the time complexity of bulk > >>> inserting m records evenly distributed across p nodes into an empty > >>> table with p tablet servers. Assuming B is the bandwidth of the > >>> network, would the communication complexity be O(m/B) and the > >>> computation complexity O(m/p * log(m/p))? If the table contained n > >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)? >
+
Jeff Kubina 20121024, 19:57

Re: What is the Communication and Time Complexity for Bulk Inserts?
Adam Fuchs 20121024, 21:50
For the bulk load of one file, shouldn't it be roughly O(log(n) * log(P) * p), where n is the size of the file, P is the total number of tablets (proportional to tablet servers), and p is the number of tablets that get assigned that file?
For the BatchWriter case, there's a clientside lookup/binning that takes O(log(p)) per entry, so the latter would be O(n/p * (log(n/p) + log(p))) for each of p partitions. So, O(n*log(n)) in aggregate. Yes/no?
Adam On Wed, Oct 24, 2012 at 3:57 PM, Jeff Kubina <[EMAIL PROTECTED]> wrote:
> @eric, assuming the records are evenly distributed and network bandwidth > is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))? > > > On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[EMAIL PROTECTED]>wrote: > >> Adding a sorted file to accumulo (bulk loading) is essentially >> constant in the normal case. It is O(n) + O(p) for the worst case >> where the index must be read, and the file assigned to every tablet >> server. In this case, the (slow) RPCs will dominate over the (fast) >> read of the index, except for very small clusters or very large >> indexes. >> >> Inserting with the BatchWriter is eventually dominated by compactions, >> which is a merge sort, or O(n log n). >> >> Eric >> >> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]> >> wrote: >> > BatchWriter, but I would be interested in the answer assuming a >> > presorted rfile. >> > >> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> >> wrote: >> >> Are you referring to "bulk inserts" as importing a presorted rfile of >> >> Key/Values or usinga BatchWriter? >> >> >> >> On 10/18/12 10:49 AM, Jeff Kubina wrote: >> >>> >> >>> I am deriving the time complexities for an algorithm I implemented in >> >>> Hadoop using Accumulo and need to know the time complexity of bulk >> >>> inserting m records evenly distributed across p nodes into an empty >> >>> table with p tablet servers. Assuming B is the bandwidth of the >> >>> network, would the communication complexity be O(m/B) and the >> >>> computation complexity O(m/p * log(m/p))? If the table contained n >> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)? >> > >
+
Adam Fuchs 20121024, 21:50

