Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> avro object reuse

Copy link to this message
RE: avro object reuse

We configure more than 100MB for MapReduce to do sorting.  Memory we allocate for doing other things in the mapper actually is larger, but, for this job, we always get out-of-meory exceptions and the job can not complete.  We try to find out if there is a way to avoid this problem.
Ey-Chih Chow

Date: Thu, 9 Jun 2011 15:42:10 -0700
Subject: Re: avro object reuse
The most likely candidate for creating many instances of BufferAccessor and ByteArrayByteSource is BinaryData.compare() and BinaryData.hashCode().  Each call will create one of each (hash) or two of each (compare).  These are only 32 bytes per instance and quickly become garbage that is easily cleaned up by the GC.  
The below have only 32 bytes each and 8MB total.On the other hand,  the byte[]'s appear to be about 24K each on average and are using 100MB.  Is this the size of your configured MapReduce sort MB?
On 6/9/11 3:08 PM, "ey-chih chow" <[EMAIL PROTECTED]> wrote:

We did more monitoring.  At one instance, we got the following histogram via Jmap.  The question is why there are so many instances of BinaryDecoder$BufferAccessor and BinaryDecoder$ByteArrayByteSource.  How to avoid this?  Thanks.

Object Histogram:

num       #instances    #bytes  Class description
1:              4199    100241168       byte[]
2:              272948  8734336 org.apache.avro.io.BinaryDecoder$BufferAccessor
3:              272945  8734240 org.apache.avro.io.BinaryDecoder$ByteArrayByteSource
4:              2093    5387976 int[]
5:              23762   2822864 * ConstMethodKlass
6:              23762   1904760 * MethodKlass
7:              39295   1688992 * SymbolKlass
8:              2127    1216976 * ConstantPoolKlass
9:              2127    882760  * InstanceKlassKlass
10:             1847    742936  * ConstantPoolCacheKlass
11:             9602    715608  char[]
12:             1072    299584  * MethodDataKlass
13:             9698    232752  java.lang.String
14:             2317    222432  java.lang.Class
15:             3288    204440  short[]
16:             3167    156664  * System ObjArray
17:             2401    57624   java.util.HashMap$Entry
18:             666     53280   java.lang.reflect.Method
19:             161     52808   * ObjArrayKlassKlass
20:             1808    43392   java.util.Hashtable$Entry
Subject: RE: avro object reuse
Date: Wed, 1 Jun 2011 15:14:03 -0700
We use a lot of toString() call on the avro Utf8 object.  Will this cause Jackson call?  Thanks.

Date: Wed, 1 Jun 2011 13:38:39 -0700
Subject: Re: avro object reuse

This is great info.
Jackson should only be used once when the file is opened, so this is confusing from that point of view.  Is something else using Jackson or initializing an Avro JsonDecoder frequently?  There are over 100000 Jackson DeserializationConfig objects.
Another place that parses the schema is in AvroSerialization.java.  Does the Hadoop getDeserializer() API method get called once per job, or per record?  If this is called more than once per map job, it might explain this.
In principle, Jackson is only used by a mapper during initialization.  The below indicates that this may not be the case or that something outside of Avro is causing a lot of Jackson JSON parsing.
Are you using something that is converting the Avro data to Json form?  toString() on most Avro datum objects will do a lot of work with Jackson, for example — but the below are deserializer objects not serializer objects so that is not likely the issue.
On 6/1/11 11:34 AM, "ey-chih chow" <[EMAIL PROTECTED]> wrote:

We ran jmap on one of our mapper and found the top usage as follows:
num  #instances #bytes Class description--------------------------------------------------------------------------1: 24405 291733256 byte[]2: 6056 40228984 int[]3: 388799 19966776 char[]4: 101779 16284640 org.codehaus.jackson.impl.ReaderBasedParser5: 369623 11827936 java.lang.String6: 111059 8769424 java.util.HashMap$Entry[]7: 204083 8163320 org.codehaus.jackson.impl.JsonReadContext8: 211374 6763968 java.util.HashMap$Entry9: 102551 5742856 org.codehaus.jackson.util.TextBuffer10: 105854 5080992 java.nio.HeapByteBuffer11: 105821 5079408 java.nio.HeapCharBuffer12: 104578 5019744 java.util.HashMap13: 102551 4922448 org.codehaus.jackson.io.IOContext14: 101782 4885536 org.codehaus.jackson.map.DeserializationConfig15: 101783 4071320 org.codehaus.jackson.sym.CharsToNameCanonicalizer16: 101779 4071160 org.codehaus.jackson.map.deser.StdDeserializationContext17: 101779 4071160 java.io.StringReader18: 101754 4070160 java.util.HashMap$KeyIterator
It looks like Jackson eats up a lot of memory.  Our mapper reads in files of the avro format.  Does avro use Jackson a lot in reading the avro files?  Is there any way to improve this?  Thanks.
Ey-Chih Chow
Date: Tue, 31 May 2011 18:26:23 -0700
Subject: Re: avro object reuse

All of those instances are short-lived.   If you are running out of memory, its not likely due to object reuse.  This tends to cause more CPU time in the garbage collector, but not out of memory conditions.  This can be hard to do on a cluster, but grabbing 'jmap –histo' output from a JVM that has a larger-than-expected JVM heap usage can often be used to quickly identify the cause of memory consumption issues.
I'm not sure if AvroUtf8InputFormat can safely re-use its instances of Utf8 or not.

On 5/31/11 5:40 PM, "ey-chih chow" <[EMAIL PROTECTED]> wrote:

I actually looked into Avro code to find out how Avro does object reuse.  I looked at AvroUtf8InputFormat and got the following question.  Why a new Utf8 object has to be created each time the method next(AvroWrapper<Utf8> key, NullWritable value) is called ?  Will t
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB