Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to perfom a logical diff on two PigStorage files

Copy link to this message
How to perfom a logical diff on two PigStorage files
Hi all,

I'm trying to build a non regression testing tool to verify that the
produced by two Pig scripts are equals.

The files are in PigStorage format. The first field is a key and
fields are opaque data (primitive or complex types).

1 43 {(10), (12), (14)} {(55), (90)} 0 60

I want to check that each key is present in  both or neither files, and
for each key the lines are equals. By being equals I mean logical
not string or byte equality. For example, the two following lines
should be
1 43 {(10), (12), (14)} {(55), (90)} 0 60
1 43 {(12), (10), (14)} {(90), (55)} 0 60
My issue is that since this tool needs to operate on lot of different
files, it should not rely on a predefined schema. I experimented
the following idea:

f1 = LOAD '$FILE1' USING PigStorage();
f2 = LOAD '$FILE2' USING PigStorage();

g_f1 = GROUP f1 BY $0;
g_f2 = GROUP f2 BY $0;

joined = JOIN
g_f1  by group full outer,
g_f2  by group;

cmp = FILTER joined by
g_f1::group is null
or  g_f2::group is null
or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;

dump cmp;

Unfortunately, since no schema is specified at load time, g_f1::f1 and
g_f2::f2 are instance of DataByteArray. It means that the DIFF function
not behave as wanted. A byte-to-byte comparison is performed rather
than a
logical comparison. For example "1       {(2),(1)}" and "1      
are different since their byte representations are not the same.

Do you know if a such tool already exist or how to write it ?

I currently foresee three options:

   1- Specify the schema. It could be done using scripting and a
      mapping. The schema would be inserted using a variable. However
the schema
      of each file has to be described manually. This is a cumbersome
   2- Use PigStorageSchema instead of PigStorage. I believe this would
      the issue; but being stuck with 0.8.1 I'm wondering if
      is reasonably robust and side effect free to be used in production
   3- Write a custom DIFF UDF taking two DataByteArray. This option
allows to not
      modify production scripts but I don't know how much effort is
      to write a such UDF. Parsing the DataByteArray to rebuild a
      set/list/string structure seems quite easy. Do you think some part
      Pig code like Utf8StorageConverter can be reused or should I
simply write
      my own parser ?
Thanks !

- Clément