Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> hadoop without append in the absence of puts


Copy link to this message
-
Re: hadoop without append in the absence of puts
> I agree that even though these are rare, they are not rare enough to take a
> risk. But they could be rare enough to justify a less efficient
> implementation of the WAL. Would it be reasonable to use an implementation
> of HLog that - at the price of performance - persists the WAL to HDFS
> without relying on append?

Since edits to .META. should be rare if you're using bulkImport, why
not just pay the cost of appending to the WAL? My guess is performance
would not suffer greatly. Would be nice to benchmark my claim :)

-Joey

On Wed, Jun 22, 2011 at 9:45 PM, Andreas Neumann <[EMAIL PROTECTED]> wrote:
> Thanks Andy for the clear response.
>
> We are indeed going to use bulk load only, and no puts, deletes or
> increments. So the only puts we will have are those that are caused by
> changes in the table structure. I guess that includes region splits but also
> reassignment of a region after its region server died.
>
> I agree that even though these are rare, they are not rare enough to take a
> risk. But they could be rare enough to justify a less efficient
> implementation of the WAL. Would it be reasonable to use an implementation
> of HLog that - at the price of performance - persists the WAL to HDFS
> without relying on append?
>
> Cheers -Andreas.
>
>
> On Wed, Jun 22, 2011 at 4:36 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote:
>
>> > From: Andreas Neumann <[EMAIL PROTECTED]>
>> > If we only load data in bulk (that is, via doBulkLoad(), not using
>> > TableOutputFormat), do we still risk data loss? My understanding is
>> > that append is needed for the WAL, and the WAL is needed only
>> > for puts. But bulk loads bypass the WAL.
>>
>> Correct.
>>
>> If you are doing read-only serving of HFiles built by MR and loaded by
>> doBulkLoad, then you would not need append support.
>>
>> If adding new data to tables via the HBase API, then sooner or later this
>> will change table structure, which is recorded via Puts to META, which is
>> self-hosted. Circumstances where those edits can be lost without working
>> append support in HDFS may be rare but not rare enough in my estimation.
>> Losing edits to META is bad. This can lead to missing regions and hung
>> clients. Human intervention will be necessary and the time scale for
>> administrative recovery is usually an availability problem.
>>
>> > For instance, when a region is split, the master must write
>> > the new meta data to the meta regions. Would that require a WAL
>> > or rely on append in some other way?
>>
>> See above.
>>
>> > Are there other situations where the WAL is needed (or append
>> > is needed) to avoid data loss?
>>
>> Deletes? Increments? For these operations you would not lose data per se if
>> you don't have append support, but the client may be incorrectly led to
>> believe they were successfully applied under the same low probability
>> failure conditions that can corrupt META.
>>
>>  - Andy
>>
>>
>

--
Joseph Echeverria
Cloudera, Inc.
443.305.9434
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB