Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Explosion in datasize using HBase as a MR sink


+
Rob 2013-05-29, 15:28
+
Ted Yu 2013-05-29, 16:20
+
Rob 2013-05-29, 19:27
+
Ted Yu 2013-05-29, 19:32
+
Rob 2013-05-29, 20:44
+
Stack 2013-05-30, 02:51
+
Rob Verkuylen 2013-05-30, 19:52
Copy link to this message
-
Re: Explosion in datasize using HBase as a MR sink
On your data set size, I would go on HFile OutputFormat and then bulk load in into HBase. Why go through the Put flow anyway (memstore, flush, WAL), especially if you have the input ready at your disposal for re-try if something fails?
Sounds faster to me anyway.

On May 30, 2013, at 10:52 PM, Rob Verkuylen <[EMAIL PROTECTED]> wrote:

>
> On May 30, 2013, at 4:51, Stack <[EMAIL PROTECTED]> wrote:
>
>> Triggering a major compaction does not alter the overall 217.5GB size?
>
> A major compaction reduces the size from the original 219GB to the 217,5GB, so barely a reduction.
> 80% of the region sizes are 1,4GB before and after. I haven't merged the smaller regions,
> but that still would not bring the size down to the 2,5-5 or so GB I would expect given T2's size.
>
>> You have speculative execution turned on in your MR job so its possible you
>> write many versions?
>
> I've turned off speculative execution (through conf.set) just for the mappers, since we're not using reducers, should we?
> I will triple check the actual job settings in the job tracker, since I need to make the settings on a job level.
>
>> Does your MR job fail many tasks (and though it fails, until it fails, it
>> will have written some subset of the task hence bloating your versions?).
>
> We've had problems with failing mappers, because of zookeeper timeouts on large inserts,
> we increased zookeeper timeout and blockingstorefiles to accommodate. Now we don't
> get failures. This job writes to a cleanly made table, versions set to 1, so there shouldn't be
> extra versions I assume(?).
>
>> You are putting everything into protobufs?  Could that be bloating your
>> data?  Can you take a smaller subset and dump to the log a string version
>> of the pb.  Use TextFormat
>> https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/TextFormat#shortDebugString(com.google.protobuf.MessageOrBuilder)
>
> The protobufs reduce the size to roughly 40% of the original XML data in T1.
> The MR parser is a port of the python parse code we use going from T1 to T2.
> I've done manual comparisons on 20-30 records from T2.1 and T2 and they are identical,
> with only minute differences, because of slightly different parsing. I've done these in hbase shell,
> I will try log dumping them too.
>
>> It can be informative looking at hfile content.  It could give you a clue
>> as to the bloat.  See http://hbase.apache.org/book.html#hfile_tool
>
> I will give this a go and report back. Any other debugging suggestions are more then welcome :)
>
> Thnx, Rob
>

+
Rob Verkuylen 2013-06-04, 19:58
+
Stack 2013-06-04, 23:07