Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> What is the Communication and Time Complexity for Bulk Inserts?

Jeff Kubina 2012-10-18, 14:49
Josh Elser 2012-10-18, 15:20
Jeff Kubina 2012-10-18, 15:37
Copy link to this message
Re: What is the Communication and Time Complexity for Bulk Inserts?
Adding a sorted file to accumulo (bulk loading) is essentially
constant in the normal case.  It is O(n) + O(p) for the worst case
where the index must be read, and the file assigned to every tablet
server.  In this case, the (slow) RPCs will dominate over the (fast)
read of the index, except for very small clusters or very large

Inserting with the BatchWriter is eventually dominated by compactions,
which is a merge sort, or O(n log n).


On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]> wrote:
> BatchWriter, but I would be interested in the answer assuming a
> pre-sorted rfile.
> On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> wrote:
>> Are you referring to "bulk inserts" as importing a pre-sorted rfile of
>> Key/Values or usinga BatchWriter?
>> On 10/18/12 10:49 AM, Jeff Kubina wrote:
>>> I am deriving the time complexities for an algorithm I implemented in
>>> Hadoop using Accumulo and need to know the time complexity of bulk
>>> inserting m records evenly distributed across p nodes into an empty
>>> table with p tablet servers. Assuming B is the bandwidth of the
>>> network, would the communication complexity be O(m/B) and the
>>> computation complexity O(m/p * log(m/p))? If the table contained n
>>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
Jeff Kubina 2012-10-24, 19:57
Adam Fuchs 2012-10-24, 21:50