Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Using Arrays in Apache Avro


Copy link to this message
-
Re: Using Arrays in Apache Avro
I had no hand in the design, but it is very elegant and I'll throw in my two cents.

Avro is an interchange format. The in memory representation is entirely up to you and your implementation language of choice.

The provided Java implementation, allows for seamless mixing of Generic (everything is Object with some conventions e.g. Strings must be some sort of CharSequence but are generally read as String or Utf8, arrays are handled as java.util.List), Specific (Which allows generated java classes for Record schemas and use of real Java enums for Enum schemas), and Reflect (which allows you to serialize/deserialize regular Java objects via reflection - and incidentally DOES support (de)serialization of native arrays).

Since at the Generic level any in memory representation of any schema is an Object (that includes the primitive types which must be boxed and null we could argue about semantically), it would be hard to deal with unboxed primitive array elements anyway. At that point, I don't think there is any real benefit to using native arrays, and as mentioned, java.util.List provides a more flexible interface (note when (not de)serializating any java.util.Collection will do, though it is to your benefit to use one with a defined ordering). Note also that Avro supports object re-use during deserialization which is more likely to be effective with a List implementation (since you can't change the size of an array)

Were you really to care (as per my elegant point above) you can implement your own in memory representations (though you'd want to have a pretty good reason, and I'm not suggesting this is one of them). Indeed this is a feature we do use ourselves where for a certain application data type the most natural in memory representation is quite different from the most efficient serialized schema. Avro makes it easy for us to do this without "hacking" anything, though at the cost of implementing a relatively small amount of code, and in our case we only care about it in Java

On Sep 24, 2013, at 2:20 PM, Mika Ristimaki <[EMAIL PROTECTED]> wrote:

>
> On Sep 24, 2013, at 9:46 PM, Raihan Jamal <[EMAIL PROTECTED]> wrote:
>
>> Thanks a lot Mika. Yeah, it works now but my second question is- Does the avro schema that I have made looks good as compared to JSON value that we were using previously?
>> I thought we can use an array for that so designed like that using an Apache Avro..
>>
>
> This is an application design question, and not related to Avro. If you have a list of prices,  array is a good place to store them.
>
>> And also why Avro Array uses java.util.List datatype? Just curious to know on that as well.
>
> Someone who has actually designed Avro can answer this better, but I assume that List was chosen because it is much more convenient to use than java arrays. You don't need to know the size before hand, etc.
>
> -Mika
>
>>
>> Thanks for the help.
>>
>>
>>
>>
>>
>>
>>
>> Raihan Jamal
>>
>>
>> On Tue, Sep 24, 2013 at 11:40 AM, Mika Ristimaki <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> Avro array uses java.util.List datatype. So you must do something like
>>
>> List<Double> nums = new ArrayList<Double>();
>> nums.add(new Double(9.97));
>> .
>> .
>>
>> On Sep 24, 2013, at 9:02 PM, Raihan Jamal <[EMAIL PROTECTED]> wrote:
>>
>>> Earlier, I was using JSON in our project so one of our attribute data looks like below in JSON format. Below is the attribute `e3` data in JSON format.
>>>
>>> {"lv":[{"v":{"prc":9.97}},{"v":{"prc":5.56}},{"v":{"prc":21.48}}]}
>>>
>>> Now, I am planning to use Apache Avro for our Data Serialization format. So I decided to design the Avro schema for the above attributes data. And I came up with the below design.
>>>  
>>>   {
>>>      "namespace": "com.avro.test.AvroExperiment",
>>>      "type": "record",
>>>      "name": "AVG_PRICE",
>>>      "doc": "AVG_PRICE data",
>>>      "fields": [
>>>          {"name": "prc", "type": {"type": "array", "items": "double"}}
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB