|
|
-
How to perfom a logical diff on two PigStorage files
Clément MATHIEU 2012-11-30, 13:48
Hi all,
I'm trying to build a non regression testing tool to verify that the files produced by two Pig scripts are equals.
The files are in PigStorage format. The first field is a key and remaining fields are opaque data (primitive or complex types).
Example: 1 43 {(10), (12), (14)} {(55), (90)} 0 60
I want to check that each key is present in both or neither files, and that for each key the lines are equals. By being equals I mean logical equality not string or byte equality. For example, the two following lines should be equal: 1 43 {(10), (12), (14)} {(55), (90)} 0 60 1 43 {(12), (10), (14)} {(90), (55)} 0 60 My issue is that since this tool needs to operate on lot of different files, it should not rely on a predefined schema. I experimented the following idea:
------ f1 = LOAD '$FILE1' USING PigStorage(); f2 = LOAD '$FILE2' USING PigStorage();
g_f1 = GROUP f1 BY $0; g_f2 = GROUP f2 BY $0;
joined = JOIN g_f1 by group full outer, g_f2 by group;
cmp = FILTER joined by g_f1::group is null or g_f2::group is null or SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;
dump cmp; ------
Unfortunately, since no schema is specified at load time, g_f1::f1 and g_f2::f2 are instance of DataByteArray. It means that the DIFF function does not behave as wanted. A byte-to-byte comparison is performed rather than a logical comparison. For example "1 {(2),(1)}" and "1 {(1),(2)}" are different since their byte representations are not the same.
Do you know if a such tool already exist or how to write it ?
I currently foresee three options:
1- Specify the schema. It could be done using scripting and a file-to-schema mapping. The schema would be inserted using a variable. However the schema of each file has to be described manually. This is a cumbersome process. 2- Use PigStorageSchema instead of PigStorage. I believe this would solve the issue; but being stuck with 0.8.1 I'm wondering if PigStorageSchema is reasonably robust and side effect free to be used in production scripts. 3- Write a custom DIFF UDF taking two DataByteArray. This option allows to not modify production scripts but I don't know how much effort is required to write a such UDF. Parsing the DataByteArray to rebuild a set/list/string structure seems quite easy. Do you think some part of Pig code like Utf8StorageConverter can be reused or should I simply write my own parser ? Thanks !
- Clément
+
Clément MATHIEU 2012-11-30, 13:48
-
Re: How to perfom a logical diff on two PigStorage files
Ruslan Al-Fakikh 2012-11-30, 20:49
Hi,
As for point 1: it will always be cumbersome to work on such files. I would recommend using Avro where the schema is included in the file. Also you could try to sort contents or apply some transformation to force the files look the same. Then just diff the files outside of Pig, that's just an idea, I'm not sure whether it'll work for you.
Thanks On Fri, Nov 30, 2012 at 5:48 PM, Clément MATHIEU <[EMAIL PROTECTED]>wrote:
> Hi all, > > I'm trying to build a non regression testing tool to verify that the files > produced by two Pig scripts are equals. > > The files are in PigStorage format. The first field is a key and remaining > fields are opaque data (primitive or complex types). > > Example: > 1 43 {(10), (12), (14)} {(55), (90)} 0 60 > > I want to check that each key is present in both or neither files, and > that > for each key the lines are equals. By being equals I mean logical equality > not string or byte equality. For example, the two following lines should be > equal: > 1 43 {(10), (12), (14)} {(55), (90)} 0 60 > 1 43 {(12), (10), (14)} {(90), (55)} 0 60 > > > My issue is that since this tool needs to operate on lot of different > files, it should not rely on a predefined schema. I experimented > the following idea: > > ------ > f1 = LOAD '$FILE1' USING PigStorage(); > f2 = LOAD '$FILE2' USING PigStorage(); > > g_f1 = GROUP f1 BY $0; > g_f2 = GROUP f2 BY $0; > > joined = JOIN > g_f1 by group full outer, > g_f2 by group; > > cmp = FILTER joined by > g_f1::group is null > or g_f2::group is null > or SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0; > > dump cmp; > ------ > > Unfortunately, since no schema is specified at load time, g_f1::f1 and > g_f2::f2 are instance of DataByteArray. It means that the DIFF function > does > not behave as wanted. A byte-to-byte comparison is performed rather than a > logical comparison. For example "1 {(2),(1)}" and "1 {(1),(2)}" > are different since their byte representations are not the same. > > Do you know if a such tool already exist or how to write it ? > > I currently foresee three options: > > 1- Specify the schema. It could be done using scripting and a > file-to-schema > mapping. The schema would be inserted using a variable. However the > schema > of each file has to be described manually. This is a cumbersome > process. > 2- Use PigStorageSchema instead of PigStorage. I believe this would solve > the issue; but being stuck with 0.8.1 I'm wondering if > PigStorageSchema > is reasonably robust and side effect free to be used in production > scripts. > 3- Write a custom DIFF UDF taking two DataByteArray. This option allows > to not > modify production scripts but I don't know how much effort is required > to write a such UDF. Parsing the DataByteArray to rebuild a > set/list/string structure seems quite easy. Do you think some part of > Pig code like Utf8StorageConverter can be reused or should I simply > write > my own parser ? > > > Thanks ! > > - Clément > > >
+
Ruslan Al-Fakikh 2012-11-30, 20:49
-
Re: How to perfom a logical diff on two PigStorage files
Bill Graham 2012-11-30, 23:14
I've done this in two passes. First I do an intersection test and determine the outer misses by join key on each side, similar to what you've done. I then store the left_only and right_only side for further inspection.
Then I take the intersection relation, which contains a left and right tuple and I pass that through a UDF. This is similar to your #3 proposal, only the UDF takes two tuples. It traverses them in parallel before outputting a string representation of a bitmask of which tuple field matched or missed. Group on the bitmasks to generate counts and you get a report of all the different combos of field misses. All without a known schema.
On Fri, Nov 30, 2012 at 12:49 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:
> Hi, > > As for point 1: it will always be cumbersome to work on such files. I would > recommend using Avro where the schema is included in the file. > Also you could try to sort contents or apply some transformation to force > the files look the same. Then just diff the files outside of Pig, that's > just an idea, I'm not sure whether it'll work for you. > > Thanks > > > On Fri, Nov 30, 2012 at 5:48 PM, Clément MATHIEU <[EMAIL PROTECTED] > >wrote: > > > Hi all, > > > > I'm trying to build a non regression testing tool to verify that the > files > > produced by two Pig scripts are equals. > > > > The files are in PigStorage format. The first field is a key and > remaining > > fields are opaque data (primitive or complex types). > > > > Example: > > 1 43 {(10), (12), (14)} {(55), (90)} 0 > 60 > > > > I want to check that each key is present in both or neither files, and > > that > > for each key the lines are equals. By being equals I mean logical > equality > > not string or byte equality. For example, the two following lines should > be > > equal: > > 1 43 {(10), (12), (14)} {(55), (90)} 0 > 60 > > 1 43 {(12), (10), (14)} {(90), (55)} 0 > 60 > > > > > > My issue is that since this tool needs to operate on lot of different > > files, it should not rely on a predefined schema. I experimented > > the following idea: > > > > ------ > > f1 = LOAD '$FILE1' USING PigStorage(); > > f2 = LOAD '$FILE2' USING PigStorage(); > > > > g_f1 = GROUP f1 BY $0; > > g_f2 = GROUP f2 BY $0; > > > > joined = JOIN > > g_f1 by group full outer, > > g_f2 by group; > > > > cmp = FILTER joined by > > g_f1::group is null > > or g_f2::group is null > > or SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0; > > > > dump cmp; > > ------ > > > > Unfortunately, since no schema is specified at load time, g_f1::f1 and > > g_f2::f2 are instance of DataByteArray. It means that the DIFF function > > does > > not behave as wanted. A byte-to-byte comparison is performed rather than > a > > logical comparison. For example "1 {(2),(1)}" and "1 > {(1),(2)}" > > are different since their byte representations are not the same. > > > > Do you know if a such tool already exist or how to write it ? > > > > I currently foresee three options: > > > > 1- Specify the schema. It could be done using scripting and a > > file-to-schema > > mapping. The schema would be inserted using a variable. However the > > schema > > of each file has to be described manually. This is a cumbersome > > process. > > 2- Use PigStorageSchema instead of PigStorage. I believe this would > solve > > the issue; but being stuck with 0.8.1 I'm wondering if > > PigStorageSchema > > is reasonably robust and side effect free to be used in production > > scripts. > > 3- Write a custom DIFF UDF taking two DataByteArray. This option allows > > to not > > modify production scripts but I don't know how much effort is > required > > to write a such UDF. Parsing the DataByteArray to rebuild a > > set/list/string structure seems quite easy. Do you think some part
*Note that I'm no longer using my Yahoo! email address. Please email me at [EMAIL PROTECTED] going forward.*
+
Bill Graham 2012-11-30, 23:14
|
|