Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> What is the Communication and Time Complexity for Bulk Inserts?

Copy link to this message
Re: What is the Communication and Time Complexity for Bulk Inserts?
For the bulk load of one file, shouldn't it be roughly O(log(n) * log(P) *
p), where n is the size of the file, P is the total number of tablets
(proportional to tablet servers), and p is the number of tablets that get
assigned that file?

For the BatchWriter case, there's a client-side lookup/binning that takes
O(log(p)) per entry, so the latter would be O(n/p * (log(n/p) + log(p)))
for each of p partitions. So, O(n*log(n)) in aggregate. Yes/no?

On Wed, Oct 24, 2012 at 3:57 PM, Jeff Kubina <[EMAIL PROTECTED]> wrote:

> @eric, assuming the records are evenly distributed and network bandwidth
> is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))?
> On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[EMAIL PROTECTED]>wrote:
>> Adding a sorted file to accumulo (bulk loading) is essentially
>> constant in the normal case.  It is O(n) + O(p) for the worst case
>> where the index must be read, and the file assigned to every tablet
>> server.  In this case, the (slow) RPCs will dominate over the (fast)
>> read of the index, except for very small clusters or very large
>> indexes.
>> Inserting with the BatchWriter is eventually dominated by compactions,
>> which is a merge sort, or O(n log n).
>> -Eric
>> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]>
>> wrote:
>> > BatchWriter, but I would be interested in the answer assuming a
>> > pre-sorted rfile.
>> >
>> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]>
>> wrote:
>> >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of
>> >> Key/Values or usinga BatchWriter?
>> >>
>> >> On 10/18/12 10:49 AM, Jeff Kubina wrote:
>> >>>
>> >>> I am deriving the time complexities for an algorithm I implemented in
>> >>> Hadoop using Accumulo and need to know the time complexity of bulk
>> >>> inserting m records evenly distributed across p nodes into an empty
>> >>> table with p tablet servers. Assuming B is the bandwidth of the
>> >>> network, would the communication complexity be O(m/B) and the
>> >>> computation complexity O(m/p * log(m/p))? If the table contained n
>> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?