Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> What is the Communication and Time Complexity for Bulk Inserts?


+
Jeff Kubina 2012-10-18, 14:49
+
Josh Elser 2012-10-18, 15:20
+
Jeff Kubina 2012-10-18, 15:37
+
Eric Newton 2012-10-24, 18:45
+
Jeff Kubina 2012-10-24, 19:57
Copy link to this message
-
Re: What is the Communication and Time Complexity for Bulk Inserts?
For the bulk load of one file, shouldn't it be roughly O(log(n) * log(P) *
p), where n is the size of the file, P is the total number of tablets
(proportional to tablet servers), and p is the number of tablets that get
assigned that file?

For the BatchWriter case, there's a client-side lookup/binning that takes
O(log(p)) per entry, so the latter would be O(n/p * (log(n/p) + log(p)))
for each of p partitions. So, O(n*log(n)) in aggregate. Yes/no?

Adam
On Wed, Oct 24, 2012 at 3:57 PM, Jeff Kubina <[EMAIL PROTECTED]> wrote:

> @eric, assuming the records are evenly distributed and network bandwidth
> is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))?
>
>
> On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[EMAIL PROTECTED]>wrote:
>
>> Adding a sorted file to accumulo (bulk loading) is essentially
>> constant in the normal case.  It is O(n) + O(p) for the worst case
>> where the index must be read, and the file assigned to every tablet
>> server.  In this case, the (slow) RPCs will dominate over the (fast)
>> read of the index, except for very small clusters or very large
>> indexes.
>>
>> Inserting with the BatchWriter is eventually dominated by compactions,
>> which is a merge sort, or O(n log n).
>>
>> -Eric
>>
>> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]>
>> wrote:
>> > BatchWriter, but I would be interested in the answer assuming a
>> > pre-sorted rfile.
>> >
>> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]>
>> wrote:
>> >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of
>> >> Key/Values or usinga BatchWriter?
>> >>
>> >> On 10/18/12 10:49 AM, Jeff Kubina wrote:
>> >>>
>> >>> I am deriving the time complexities for an algorithm I implemented in
>> >>> Hadoop using Accumulo and need to know the time complexity of bulk
>> >>> inserting m records evenly distributed across p nodes into an empty
>> >>> table with p tablet servers. Assuming B is the bandwidth of the
>> >>> network, would the communication complexity be O(m/B) and the
>> >>> computation complexity O(m/p * log(m/p))? If the table contained n
>> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB