Jeff Kubina 2012-10-18, 14:49
Josh Elser 2012-10-18, 15:20
Jeff Kubina 2012-10-18, 15:37
Eric Newton 2012-10-24, 18:45
Jeff Kubina 2012-10-24, 19:57
-Re: What is the Communication and Time Complexity for Bulk Inserts?
Adam Fuchs 2012-10-24, 21:50
For the bulk load of one file, shouldn't it be roughly O(log(n) * log(P) *
p), where n is the size of the file, P is the total number of tablets
(proportional to tablet servers), and p is the number of tablets that get
assigned that file?
For the BatchWriter case, there's a client-side lookup/binning that takes
O(log(p)) per entry, so the latter would be O(n/p * (log(n/p) + log(p)))
for each of p partitions. So, O(n*log(n)) in aggregate. Yes/no?
On Wed, Oct 24, 2012 at 3:57 PM, Jeff Kubina <[EMAIL PROTECTED]> wrote:
> @eric, assuming the records are evenly distributed and network bandwidth
> is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))?
> On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[EMAIL PROTECTED]>wrote:
>> Adding a sorted file to accumulo (bulk loading) is essentially
>> constant in the normal case. It is O(n) + O(p) for the worst case
>> where the index must be read, and the file assigned to every tablet
>> server. In this case, the (slow) RPCs will dominate over the (fast)
>> read of the index, except for very small clusters or very large
>> Inserting with the BatchWriter is eventually dominated by compactions,
>> which is a merge sort, or O(n log n).
>> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]>
>> > BatchWriter, but I would be interested in the answer assuming a
>> > pre-sorted rfile.
>> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]>
>> >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of
>> >> Key/Values or usinga BatchWriter?
>> >> On 10/18/12 10:49 AM, Jeff Kubina wrote:
>> >>> I am deriving the time complexities for an algorithm I implemented in
>> >>> Hadoop using Accumulo and need to know the time complexity of bulk
>> >>> inserting m records evenly distributed across p nodes into an empty
>> >>> table with p tablet servers. Assuming B is the bandwidth of the
>> >>> network, would the communication complexity be O(m/B) and the
>> >>> computation complexity O(m/p * log(m/p))? If the table contained n
>> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?