Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - How to perfom a logical diff on two PigStorage files


+
Clément MATHIEU 2012-11-30, 13:48
Copy link to this message
-
Re: How to perfom a logical diff on two PigStorage files
Ruslan Al-Fakikh 2012-11-30, 20:49
Hi,

As for point 1: it will always be cumbersome to work on such files. I would
recommend using Avro where the schema is included in the file.
Also you could try to sort contents or apply some transformation to force
the files look the same. Then just diff the files outside of Pig, that's
just an idea, I'm not sure whether it'll work for you.

Thanks
On Fri, Nov 30, 2012 at 5:48 PM, Clément MATHIEU <[EMAIL PROTECTED]>wrote:

> Hi all,
>
> I'm trying to build a non regression testing tool to verify that the files
> produced by two Pig scripts are equals.
>
> The files are in PigStorage format. The first field is a key and remaining
> fields are opaque data (primitive or complex types).
>
> Example:
>         1       43      {(10), (12), (14)}      {(55), (90)}    0       60
>
> I want to check that each key is present in  both or neither files, and
> that
> for each key the lines are equals. By being equals I mean logical equality
> not string or byte equality. For example, the two following lines should be
> equal:
>         1       43      {(10), (12), (14)}      {(55), (90)}    0       60
>         1       43      {(12), (10), (14)}      {(90), (55)}    0       60
>
>
> My issue is that since this tool needs to operate on lot of different
> files, it should not rely on a predefined schema. I experimented
> the following idea:
>
> ------
>         f1 = LOAD '$FILE1' USING PigStorage();
>         f2 = LOAD '$FILE2' USING PigStorage();
>
>         g_f1 = GROUP f1 BY $0;
>         g_f2 = GROUP f2 BY $0;
>
>         joined = JOIN
>                 g_f1  by group full outer,
>                 g_f2  by group;
>
>         cmp = FILTER joined by
>                 g_f1::group is null
>                 or  g_f2::group is null
>                 or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;
>
>         dump cmp;
> ------
>
> Unfortunately, since no schema is specified at load time, g_f1::f1 and
> g_f2::f2 are instance of DataByteArray. It means that the DIFF function
> does
> not behave as wanted. A byte-to-byte comparison is performed rather than a
> logical comparison. For example "1       {(2),(1)}" and "1       {(1),(2)}"
> are different since their byte representations are not the same.
>
> Do you know if a such tool already exist or how to write it ?
>
> I currently foresee three options:
>
>   1- Specify the schema. It could be done using scripting and a
> file-to-schema
>      mapping. The schema would be inserted using a variable. However the
> schema
>      of each file has to be described manually. This is a cumbersome
> process.
>   2- Use PigStorageSchema instead of PigStorage. I believe this would solve
>      the issue; but being stuck with 0.8.1 I'm wondering if
> PigStorageSchema
>      is reasonably robust and side effect free to be used in production
> scripts.
>   3- Write a custom DIFF UDF taking two DataByteArray. This option allows
> to not
>      modify production scripts but I don't know how much effort is required
>      to write a such UDF. Parsing the DataByteArray to rebuild a
>      set/list/string structure seems quite easy. Do you think some part of
>      Pig code like Utf8StorageConverter can be reused or should I simply
> write
>      my own parser ?
>
>
> Thanks !
>
> - Clément
>
>
>
+
Bill Graham 2012-11-30, 23:14