|
|
-
Does Avro GenericData.Record violate the .equals contract?
Andrew Kenworthy 2012-02-09, 15:02
Hallo,
I'm working with avro as the serialization framework for my hadoop map-reduce jobs, and am emitting GenericRecord/null as my K/V values from my mapper classes. Having looked at the code, I see that the "key" objects (i.e. my records) are only recognised as being discrete by my reducer if it sees that the .equals() method called on the record shows a distinction. However, if the schema is the same (which it is for most of my mappers), then .equals() calls .compare(), which in turn depends on the ORDER attributes set on the fields. This means that if I have no sorting defined in my schema, that all records are treated as being equal to one another. Have I understood this correctly, and if so, is that not a violation of the equals contract? (for one thing, it would mean GenericRecord objects will often cause confusion when used with maps and other containers). Regards,
Andrew
-
Re: Does Avro GenericData.Record violate the .equals contract?
Doug Cutting 2012-02-09, 19:49
On 02/09/2012 07:02 AM, Andrew Kenworthy wrote: > This means that if I have no sorting defined in my schema, that all > records are treated as being equal to one another.
If you specify "order":"ignore" for all fields in a record, then, yes, all instances of that record would be equal. I cannot imagine a case where this would be useful, but I also don't see how this would violate the equals() contract.
The default for fields is to behave as if "order":"ascending" is specified. Records are equal if all of their fields that are not specified as "order":"ignore" are equal.
Doug
-
Re: Does Avro GenericData.Record violate the .equals contract?
Andrew Kenworthy 2012-02-10, 12:26
Hallo Doug,
Thank you for your feedback. I agree that implicitly using Order.IGNORE to ignore differences in records makes sense, as that is the criteria used to define distinction when sorting. But it looks as though only the schema name is checked when deciding whether to examine each field or not. This can, as the test below shows, result in a lack of symmetry when using equals if one is not careful (i.e. the example is a "bad" one as it's not a good idea to have two schemas with the same name and namespace yet with different contents, but shows how one might inadvertently make a wrong assumption about equality):-
@Test public void test() { Schema schema1 = Schema.createRecord("test_record", null, "my.namespace", false); List<Field> fields1 = new ArrayList<Field>(); fields1.add(new Field("attribute1", Schema.create(Schema.Type.STRING), null, null, Order.IGNORE)); schema1.setFields(fields1); Schema schema2 = Schema.createRecord("test_record", null, "my.namespace", false); List<Field> fields2 = new ArrayList<Field>(); fields2.add(new Field("attribute1", Schema.create(Schema.Type.STRING), null, null, Order.ASCENDING)); schema2.setFields(fields2); GenericRecord record1 = new GenericData.Record(schema1); record1.put("attribute1", "1"); GenericRecord record2 = new GenericData.Record(schema2); record2.put("attribute1", "2"); System.out.println(record1.equals(record2)); // returns TRUE System.out.println(record2.equals(record1)); // returns FALSE }
Andrew
>________________________________ > From: Doug Cutting <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Thursday, February 9, 2012 8:49 PM >Subject: Re: Does Avro GenericData.Record violate the .equals contract? > >On 02/09/2012 07:02 AM, Andrew Kenworthy wrote: >> This means that if I have no sorting defined in my schema, that all >> records are treated as being equal to one another. > >If you specify "order":"ignore" for all fields in a record, then, yes, >all instances of that record would be equal. I cannot imagine a case >where this would be useful, but I also don't see how this would violate >the equals() contract. > >The default for fields is to behave as if "order":"ascending" is >specified. Records are equal if all of their fields that are not >specified as "order":"ignore" are equal. > >Doug > > >
-
Re: Does Avro GenericData.Record violate the .equals contract?
Doug Cutting 2012-02-10, 17:57
This does look like a bug in GenericData.Record#equals(). It should return false when the schemas are not equal. It currently only checks the schema names as a performance optimization, but that optimization is not a good one. Can you please file a bug report in Jira?
Thanks,
Doug
On 02/10/2012 04:26 AM, Andrew Kenworthy wrote: > Hallo Doug, > > Thank you for your feedback. I agree that implicitly using Order.IGNORE > to ignore differences in records makes sense, as that is the criteria > used to define distinction when sorting. But it looks as though only the > schema name is checked when deciding whether to examine each field or > not. This can, as the test below shows, result in a lack of symmetry > when using equals if one is not careful (i.e. the example is a "bad" one > as it's not a good idea to have two schemas with the same name and > namespace yet with different contents, but shows how one might > inadvertently make a wrong assumption about equality):- > > @Test > public void test() { > Schema schema1 = Schema.createRecord("test_record", null, > "my.namespace", false); > List<Field> fields1 = new ArrayList<Field>(); > fields1.add(new Field("attribute1", Schema.create(Schema.Type.STRING), > null, null, Order.IGNORE)); > schema1.setFields(fields1); > Schema schema2 = Schema.createRecord("test_record", null, > "my.namespace", false); > List<Field> fields2 = new ArrayList<Field>(); > fields2.add(new Field("attribute1", Schema.create(Schema.Type.STRING), > null, null, Order.ASCENDING)); > schema2.setFields(fields2); > GenericRecord record1 = new GenericData.Record(schema1); > record1.put("attribute1", "1"); > GenericRecord record2 = new GenericData.Record(schema2); > record2.put("attribute1", "2"); > System.out.println(record1.equals(record2)); // returns TRUE > System.out.println(record2.equals(record1)); // returns FALSE > } > > Andrew > > ------------------------------------------------------------------------ > *From:* Doug Cutting <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED] > *Sent:* Thursday, February 9, 2012 8:49 PM > *Subject:* Re: Does Avro GenericData.Record violate the .equals > contract? > > On 02/09/2012 07:02 AM, Andrew Kenworthy wrote: > > This means that if I have no sorting defined in my schema, that all > > records are treated as being equal to one another. > > If you specify "order":"ignore" for all fields in a record, then, yes, > all instances of that record would be equal. I cannot imagine a case > where this would be useful, but I also don't see how this would violate > the equals() contract. > > The default for fields is to behave as if "order":"ascending" is > specified. Records are equal if all of their fields that are not > specified as "order":"ignore" are equal. > > Doug > >
|
|