|
|
-
What is the Communication and Time Complexity for Bulk Inserts?
Jeff Kubina 2012-10-18, 14:49
I am deriving the time complexities for an algorithm I implemented in Hadoop using Accumulo and need to know the time complexity of bulk inserting m records evenly distributed across p nodes into an empty table with p tablet servers. Assuming B is the bandwidth of the network, would the communication complexity be O(m/B) and the computation complexity O(m/p * log(m/p))? If the table contained n records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
-
Re: What is the Communication and Time Complexity for Bulk Inserts?
Josh Elser 2012-10-18, 15:20
Are you referring to "bulk inserts" as importing a pre-sorted rfile of Key/Values or usinga BatchWriter?
On 10/18/12 10:49 AM, Jeff Kubina wrote: > I am deriving the time complexities for an algorithm I implemented in > Hadoop using Accumulo and need to know the time complexity of bulk > inserting m records evenly distributed across p nodes into an empty > table with p tablet servers. Assuming B is the bandwidth of the > network, would the communication complexity be O(m/B) and the > computation complexity O(m/p * log(m/p))? If the table contained n > records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
-
Re: What is the Communication and Time Complexity for Bulk Inserts?
Jeff Kubina 2012-10-18, 15:37
BatchWriter, but I would be interested in the answer assuming a pre-sorted rfile.
On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> wrote: > Are you referring to "bulk inserts" as importing a pre-sorted rfile of > Key/Values or usinga BatchWriter? > > On 10/18/12 10:49 AM, Jeff Kubina wrote: >> >> I am deriving the time complexities for an algorithm I implemented in >> Hadoop using Accumulo and need to know the time complexity of bulk >> inserting m records evenly distributed across p nodes into an empty >> table with p tablet servers. Assuming B is the bandwidth of the >> network, would the communication complexity be O(m/B) and the >> computation complexity O(m/p * log(m/p))? If the table contained n >> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
-
Re: What is the Communication and Time Complexity for Bulk Inserts?
Eric Newton 2012-10-24, 18:45
Adding a sorted file to accumulo (bulk loading) is essentially constant in the normal case. It is O(n) + O(p) for the worst case where the index must be read, and the file assigned to every tablet server. In this case, the (slow) RPCs will dominate over the (fast) read of the index, except for very small clusters or very large indexes.
Inserting with the BatchWriter is eventually dominated by compactions, which is a merge sort, or O(n log n).
-Eric
On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]> wrote: > BatchWriter, but I would be interested in the answer assuming a > pre-sorted rfile. > > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> wrote: >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of >> Key/Values or usinga BatchWriter? >> >> On 10/18/12 10:49 AM, Jeff Kubina wrote: >>> >>> I am deriving the time complexities for an algorithm I implemented in >>> Hadoop using Accumulo and need to know the time complexity of bulk >>> inserting m records evenly distributed across p nodes into an empty >>> table with p tablet servers. Assuming B is the bandwidth of the >>> network, would the communication complexity be O(m/B) and the >>> computation complexity O(m/p * log(m/p))? If the table contained n >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
-
Re: What is the Communication and Time Complexity for Bulk Inserts?
Jeff Kubina 2012-10-24, 19:57
@eric, assuming the records are evenly distributed and network bandwidth is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))?
On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[EMAIL PROTECTED]> wrote:
> Adding a sorted file to accumulo (bulk loading) is essentially > constant in the normal case. It is O(n) + O(p) for the worst case > where the index must be read, and the file assigned to every tablet > server. In this case, the (slow) RPCs will dominate over the (fast) > read of the index, except for very small clusters or very large > indexes. > > Inserting with the BatchWriter is eventually dominated by compactions, > which is a merge sort, or O(n log n). > > -Eric > > On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]> > wrote: > > BatchWriter, but I would be interested in the answer assuming a > > pre-sorted rfile. > > > > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> > wrote: > >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of > >> Key/Values or usinga BatchWriter? > >> > >> On 10/18/12 10:49 AM, Jeff Kubina wrote: > >>> > >>> I am deriving the time complexities for an algorithm I implemented in > >>> Hadoop using Accumulo and need to know the time complexity of bulk > >>> inserting m records evenly distributed across p nodes into an empty > >>> table with p tablet servers. Assuming B is the bandwidth of the > >>> network, would the communication complexity be O(m/B) and the > >>> computation complexity O(m/p * log(m/p))? If the table contained n > >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)? >
-
Re: What is the Communication and Time Complexity for Bulk Inserts?
Adam Fuchs 2012-10-24, 21:50
For the bulk load of one file, shouldn't it be roughly O(log(n) * log(P) * p), where n is the size of the file, P is the total number of tablets (proportional to tablet servers), and p is the number of tablets that get assigned that file?
For the BatchWriter case, there's a client-side lookup/binning that takes O(log(p)) per entry, so the latter would be O(n/p * (log(n/p) + log(p))) for each of p partitions. So, O(n*log(n)) in aggregate. Yes/no?
Adam On Wed, Oct 24, 2012 at 3:57 PM, Jeff Kubina <[EMAIL PROTECTED]> wrote:
> @eric, assuming the records are evenly distributed and network bandwidth > is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))? > > > On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[EMAIL PROTECTED]>wrote: > >> Adding a sorted file to accumulo (bulk loading) is essentially >> constant in the normal case. It is O(n) + O(p) for the worst case >> where the index must be read, and the file assigned to every tablet >> server. In this case, the (slow) RPCs will dominate over the (fast) >> read of the index, except for very small clusters or very large >> indexes. >> >> Inserting with the BatchWriter is eventually dominated by compactions, >> which is a merge sort, or O(n log n). >> >> -Eric >> >> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]> >> wrote: >> > BatchWriter, but I would be interested in the answer assuming a >> > pre-sorted rfile. >> > >> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> >> wrote: >> >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of >> >> Key/Values or usinga BatchWriter? >> >> >> >> On 10/18/12 10:49 AM, Jeff Kubina wrote: >> >>> >> >>> I am deriving the time complexities for an algorithm I implemented in >> >>> Hadoop using Accumulo and need to know the time complexity of bulk >> >>> inserting m records evenly distributed across p nodes into an empty >> >>> table with p tablet servers. Assuming B is the bandwidth of the >> >>> network, would the communication complexity be O(m/B) and the >> >>> computation complexity O(m/p * log(m/p))? If the table contained n >> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)? >> > >
|
|