Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - How to perfom a logical diff on two PigStorage files


Copy link to this message
-
How to perfom a logical diff on two PigStorage files
Clément MATHIEU 2012-11-30, 13:48
Hi all,

I'm trying to build a non regression testing tool to verify that the
files
produced by two Pig scripts are equals.

The files are in PigStorage format. The first field is a key and
remaining
fields are opaque data (primitive or complex types).

Example:
1 43 {(10), (12), (14)} {(55), (90)} 0 60

I want to check that each key is present in  both or neither files, and
that
for each key the lines are equals. By being equals I mean logical
equality
not string or byte equality. For example, the two following lines
should be
equal:
1 43 {(10), (12), (14)} {(55), (90)} 0 60
1 43 {(12), (10), (14)} {(90), (55)} 0 60
My issue is that since this tool needs to operate on lot of different
files, it should not rely on a predefined schema. I experimented
the following idea:

------
f1 = LOAD '$FILE1' USING PigStorage();
f2 = LOAD '$FILE2' USING PigStorage();

g_f1 = GROUP f1 BY $0;
g_f2 = GROUP f2 BY $0;

joined = JOIN
g_f1  by group full outer,
g_f2  by group;

cmp = FILTER joined by
g_f1::group is null
or  g_f2::group is null
or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;

dump cmp;
------

Unfortunately, since no schema is specified at load time, g_f1::f1 and
g_f2::f2 are instance of DataByteArray. It means that the DIFF function
does
not behave as wanted. A byte-to-byte comparison is performed rather
than a
logical comparison. For example "1       {(2),(1)}" and "1      
{(1),(2)}"
are different since their byte representations are not the same.

Do you know if a such tool already exist or how to write it ?

I currently foresee three options:

   1- Specify the schema. It could be done using scripting and a
file-to-schema
      mapping. The schema would be inserted using a variable. However
the schema
      of each file has to be described manually. This is a cumbersome
process.
   2- Use PigStorageSchema instead of PigStorage. I believe this would
solve
      the issue; but being stuck with 0.8.1 I'm wondering if
PigStorageSchema
      is reasonably robust and side effect free to be used in production
scripts.
   3- Write a custom DIFF UDF taking two DataByteArray. This option
allows to not
      modify production scripts but I don't know how much effort is
required
      to write a such UDF. Parsing the DataByteArray to rebuild a
      set/list/string structure seems quite easy. Do you think some part
of
      Pig code like Utf8StorageConverter can be reused or should I
simply write
      my own parser ?
Thanks !

- Clément
+
Ruslan Al-Fakikh 2012-11-30, 20:49
+
Bill Graham 2012-11-30, 23:14