Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Utf8StorageConverter Not Handling Empty Tuples Properly


Copy link to this message
-
Utf8StorageConverter Not Handling Empty Tuples Properly
For Pig 0.10.1, I came across a use case for the caster *
Utf8StorageConverter.consumeTuple()* method, whereby passing an empty tuple
to the caster did not create a valid empty tuple output. The output was a
tuple object containing an empty DataByteArray.  I believe this promotes
discussion on the set-theoretic form of the empty states for this caster's
pre and post-conditions. An empty tuple is the empty set ∅, just as the
empty bag is the empty set ∅
https://en.wikipedia.org/wiki/Tuple#Tuples_as_nested_sets.  For Pig, I
believe ∅ translates for Tuples to TupleFactory.getInstance().newTuple()
and for bags BagFactory.getInstance().newDefaultBag() and not Null Objects.

Use Case:

String string_input = "()";
String string_schema = "t1:tuple()";
Tuple t1 = this.tuple_factory.newTuple();
Utf8StorageConverter caster = new Utf8StorageConverter();
LogicalSchema ls = Utils.parseSchema(schema);
ResourceSchema rs = new ResourceSchema(ls);
ResourceSchema.ResourceFieldSchema[] fields = rs.getFields();
Object result = CastUtils.convertToType(caster,
string_input.getBytes("UTF-8"), fields[0], fields[0].getType());
org.junit.Assert.assertEquals(result, t1);     // this will fail as
consumeTuple() is logically ill-defined
The Object result is of the assumed form "t1:tuple(a:bytearray)" which is
incorrect, and should be "t1:tuple()".  In other words, the result contains
a field of type DataByteArray and value 0.
Upon examining the code block, a relatively easy fix would be a conditional
on line 170-171, converting to:

*src/org/apache/pig/builtin/Utf8StorageConverter.java:*
170: DataByteArray value = new DataByteArray(mOut.toByteArray());
171: if (value.size() > 0) { //  non-empty tuple condition
172:     t.append(value);
173: }
Implementing this fix will generate these successful unit tests:
//  empty tuple test
String string_input = "()";
String string_schema = "t1:tuple()";
Tuple t1 = this.tuple_factory.newTuple();
Utf8StorageConverter caster = new Utf8StorageConverter();
LogicalSchema ls = Utils.parseSchema(schema);
ResourceSchema rs = new ResourceSchema(ls);
ResourceSchema.ResourceFieldSchema[] fields = rs.getFields();
Object result = CastUtils.convertToType(caster,
string_input.getBytes("UTF-8"), fields[0], fields[0].getType());
org.junit.Assert.assertEquals(result, t1);     // with code-block fix,
success as empty

//  for reference, the same approach with empty DataBags
String string_input = "{}";
String string_schema = "b1:bag{}";
DataBag b1 = this.bag_factory.newDefaultBag();
Utf8StorageConverter caster = new Utf8StorageConverter();
LogicalSchema ls = Utils.parseSchema(schema);
ResourceSchema rs = new ResourceSchema(ls);
ResourceSchema.ResourceFieldSchema[] fields = rs.getFields();
Object result = CastUtils.convertToType(caster,
string_input.getBytes("UTF-8"), fields[0], fields[0].getType());
org.junit.Assert.assertEquals(result, b1);  // success as empty, no
modifications required
Thoughts on this discussion point?

-Dan
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB