|
|
-
Map having <string, Object>
Gaurav 2011-12-07, 13:16
Hi, We have a requirement to send typed(key-value) pairs from server to clients (in various languages). Value can be one of primitive types or a map of same (string, Object) type. One option is to construct record schema on the fly and second option is to use unions to write schema in a general way. Problems with 1 is that we have to construct schema everytime depending upon keys and then attach the entire string schema to a relatively small record. But in second schema, u don't need to write schema on the wire as it is present with client also. I have written one such sample schema: {"type":"map","values":["int","long","float","double","string","boolean",{"type":"map","values":["int","long","float","double","string","boolean"]}]} Do you guys think writing something of this sort makes sense or is there any better approach to this? Thanks, Gaurav Nanda -- View this message in context: http://apache-avro.679487.n3.nabble.com/Map-having-string-Object-tp3567316p3567316.htmlSent from the Avro - Users mailing list archive at Nabble.com.
+
Gaurav 2011-12-07, 13:16
-
Re: Map having <string, Object>
Tatu Saloranta 2011-12-07, 15:47
On Wed, Dec 7, 2011 at 5:16 AM, Gaurav <[EMAIL PROTECTED]> wrote: > Hi, > > We have a requirement to send typed(key-value) pairs from server to clients > (in various languages). > Value can be one of primitive types or a map of same (string, Object) type. > > One option is to construct record schema on the fly and second option is to > use unions to write schema in a general way. > > Problems with 1 is that we have to construct schema everytime depending upon > keys and then attach the entire string schema to a relatively small record. > > But in second schema, u don't need to write schema on the wire as it is > present with client also. > > I have written one such sample schema: > {"type":"map","values":["int","long","float","double","string","boolean",{"type":"map","values":["int","long","float","double","string","boolean"]}]} > > Do you guys think writing something of this sort makes sense or is there any > better approach to this?
For this kind of loose data, perhaps JSON would serve you better, unless you absolutely have to use Avro?
-+ Tatu +-
+
Tatu Saloranta 2011-12-07, 15:47
-
Re: Map having <string, Object>
Gaurav Nanda 2011-12-07, 17:10
I agree that in this case Json would be equally helpful. But In my application there is one more type of message, where untagged data can provide compact data encoding. So to maintain consistency, I preferred to send these kind of messages also using avro.
@where untagged data can provide compact data encoding. In that case also, my schema has to be dynamically generated (i.e. on runtime), so has to be passed to client. So would avro be better to compressed json is that case?
Thanks, Gaurav Nanda
On Wed, Dec 7, 2011 at 9:17 PM, Tatu Saloranta <[EMAIL PROTECTED]> wrote: > On Wed, Dec 7, 2011 at 5:16 AM, Gaurav <[EMAIL PROTECTED]> wrote: >> Hi, >> >> We have a requirement to send typed(key-value) pairs from server to clients >> (in various languages). >> Value can be one of primitive types or a map of same (string, Object) type. >> >> One option is to construct record schema on the fly and second option is to >> use unions to write schema in a general way. >> >> Problems with 1 is that we have to construct schema everytime depending upon >> keys and then attach the entire string schema to a relatively small record. >> >> But in second schema, u don't need to write schema on the wire as it is >> present with client also. >> >> I have written one such sample schema: >> {"type":"map","values":["int","long","float","double","string","boolean",{"type":"map","values":["int","long","float","double","string","boolean"]}]} >> >> Do you guys think writing something of this sort makes sense or is there any >> better approach to this? > > For this kind of loose data, perhaps JSON would serve you better, > unless you absolutely have to use Avro? > > -+ Tatu +-
+
Gaurav Nanda 2011-12-07, 17:10
-
Re: Map having <string, Object>
Tatu Saloranta 2011-12-07, 17:27
On Wed, Dec 7, 2011 at 9:10 AM, Gaurav Nanda <[EMAIL PROTECTED]> wrote: > I agree that in this case Json would be equally helpful. But In my > application there is one more type of message, where untagged data can > provide compact data encoding. So to maintain consistency, I preferred > to send these kind of messages also using avro. > > @where untagged data can provide compact data encoding. > In that case also, my schema has to be dynamically generated (i.e. on > runtime), so has to be passed to client. So would avro be better to > compressed json is that case?
It seems to me that hassle of dynamic generation of one-off schemas would make this bit sub-optimal use case. Or, conversely, if you just define generic schema that allows sending of key value pairs (and perhaps type), there is no size benefit as you add all things that schema would help take out of payload. Another alternative is to use content-type or related metadata to allow use of different low-level data formats.
Beyond compressed JSON (which can be very fast with LZF or Snappy), you could also consider one of binary encodings for Json.
-+ Tatu +-
+
Tatu Saloranta 2011-12-07, 17:27
-
Re: Map having <string, Object>
Scott Carey 2011-12-07, 18:01
The best practice is usually to use the flexible schema with the union value rather than transmit schemas each time. This restricts the possibilities to the set defined, and the type selected in the branch is available on the decoding side. In the case above the number of variants is not too large for this approach to be unwieldy, and there may be benefits of knowing the type on the other side without inspecting the value.
You can construct an Avro schema that represents all possible data variants, effectively tagging the types of every field during serialization using unions. However none of the Avro APIs are (yet) optimized for this use case, it would be somewhat clumsy to work with, and is less space efficient. Other serialization systems are a better fit for completely open-ended data schemas.
One can look at Avro as a serialization system, but I see it more as a system for describing your data. It provides tools for describing and transforming data that exists in related forms (e.g. older or newer schema versions) to the form you want to see (e.g. current schema). Whether this data is serialized or an object graph is less important than that it conforms to a schema. A transformation between a serialized form and an object graph is one use case of many possibilities.
Think about your use case from that perspective. Ask whether this is data that gains benefit from describing it with an Avro Schema and then interpreting it as conforming to that schema. If it is completely open ended there may be little benefit and significant overhead.
You can also embed JSON or binary JSON in Avro data fairly easily using Jackson JSON. On 12/7/11 9:10 AM, "Gaurav Nanda" <[EMAIL PROTECTED]> wrote:
>I agree that in this case Json would be equally helpful. But In my >application there is one more type of message, where untagged data can >provide compact data encoding. So to maintain consistency, I preferred >to send these kind of messages also using avro. > >@where untagged data can provide compact data encoding. >In that case also, my schema has to be dynamically generated (i.e. on >runtime), so has to be passed to client. So would avro be better to >compressed json is that case? > >Thanks, >Gaurav Nanda > >On Wed, Dec 7, 2011 at 9:17 PM, Tatu Saloranta <[EMAIL PROTECTED]> >wrote: >> On Wed, Dec 7, 2011 at 5:16 AM, Gaurav <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> We have a requirement to send typed(key-value) pairs from server to >>>clients >>> (in various languages). >>> Value can be one of primitive types or a map of same (string, Object) >>>type. >>> >>> One option is to construct record schema on the fly and second option >>>is to >>> use unions to write schema in a general way. >>> >>> Problems with 1 is that we have to construct schema everytime >>>depending upon >>> keys and then attach the entire string schema to a relatively small >>>record. >>> >>> But in second schema, u don't need to write schema on the wire as it is >>> present with client also. >>> >>> I have written one such sample schema: >>> >>>{"type":"map","values":["int","long","float","double","string","boolean" >>>,{"type":"map","values":["int","long","float","double","string","boolean >>>"]}]} >>> >>> Do you guys think writing something of this sort makes sense or is >>>there any >>> better approach to this? >> >> For this kind of loose data, perhaps JSON would serve you better, >> unless you absolutely have to use Avro? >> >> -+ Tatu +-
+
Scott Carey 2011-12-07, 18:01
-
Re: Map having <string, Object>
Doug Cutting 2011-12-07, 17:36
On 12/07/2011 05:16 AM, Gaurav wrote: > One option is to construct record schema on the fly and second option is to > use unions to write schema in a general way. > > Problems with 1 is that we have to construct schema everytime depending upon > keys and then attach the entire string schema to a relatively small record. You might instead write the Schema more efficiently in binary. It could be written as binary Json using the following: http://avro.apache.org/docs/current/api/java/org/apache/avro/data/Json.htmlOr there's an even more efficient schema-for-schemas approach in: https://issues.apache.org/jira/browse/AVRO-251(I don't know if that patch is still up to date. If you like I can update it. If someone finds it useful then I'll commit it.) > But in second schema, u don't need to write schema on the wire as it is > present with client also. > > I have written one such sample schema: > {"type":"map","values":["int","long","float","double","string","boolean",{"type":"map","values":["int","long","float","double","string","boolean"]}]} > > Do you guys think writing something of this sort makes sense or is there any > better approach to this? A map like that is a totally reasonable approach when things vary a lot. If the schema is really different for each instance written then building a new schema each time might end up hurting performance. If there are actually only relatively few schemas that re-occur then they might be cached and reused. If some fields are always present then you might put those in a record and have a field in the record with a map like that for other stuff. This is a common approach. Every record might have a date and uid or somesuch, but other aspects may vary. Doug
+
Doug Cutting 2011-12-07, 17:36
|
|