Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> How to perfom a logical diff on two PigStorage files


Copy link to this message
-
How to perfom a logical diff on two PigStorage files
Hi all,

I'm trying to build a non regression testing tool to verify that the
files
produced by two Pig scripts are equals.

The files are in PigStorage format. The first field is a key and
remaining
fields are opaque data (primitive or complex types).

Example:
1 43 {(10), (12), (14)} {(55), (90)} 0 60

I want to check that each key is present in  both or neither files, and
that
for each key the lines are equals. By being equals I mean logical
equality
not string or byte equality. For example, the two following lines
should be
equal:
1 43 {(10), (12), (14)} {(55), (90)} 0 60
1 43 {(12), (10), (14)} {(90), (55)} 0 60
My issue is that since this tool needs to operate on lot of different
files, it should not rely on a predefined schema. I experimented
the following idea:

------
f1 = LOAD '$FILE1' USING PigStorage();
f2 = LOAD '$FILE2' USING PigStorage();

g_f1 = GROUP f1 BY $0;
g_f2 = GROUP f2 BY $0;

joined = JOIN
g_f1  by group full outer,
g_f2  by group;

cmp = FILTER joined by
g_f1::group is null
or  g_f2::group is null
or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;

dump cmp;
------

Unfortunately, since no schema is specified at load time, g_f1::f1 and
g_f2::f2 are instance of DataByteArray. It means that the DIFF function
does
not behave as wanted. A byte-to-byte comparison is performed rather
than a
logical comparison. For example "1       {(2),(1)}" and "1      
{(1),(2)}"
are different since their byte representations are not the same.

Do you know if a such tool already exist or how to write it ?

I currently foresee three options:

   1- Specify the schema. It could be done using scripting and a
file-to-schema
      mapping. The schema would be inserted using a variable. However
the schema
      of each file has to be described manually. This is a cumbersome
process.
   2- Use PigStorageSchema instead of PigStorage. I believe this would
solve
      the issue; but being stuck with 0.8.1 I'm wondering if
PigStorageSchema
      is reasonably robust and side effect free to be used in production
scripts.
   3- Write a custom DIFF UDF taking two DataByteArray. This option
allows to not
      modify production scripts but I don't know how much effort is
required
      to write a such UDF. Parsing the DataByteArray to rebuild a
      set/list/string structure seems quite easy. Do you think some part
of
      Pig code like Utf8StorageConverter can be reused or should I
simply write
      my own parser ?
Thanks !

- Clément
+
Ruslan Al-Fakikh 2012-11-30, 20:49
+
Bill Graham 2012-11-30, 23:14
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB