Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to perfom a logical diff on two PigStorage files


Copy link to this message
-
Re: How to perfom a logical diff on two PigStorage files
I've done this in two passes. First I do an intersection test and determine
the outer misses by join key on each side, similar to what you've done. I
then store the left_only and right_only side for further inspection.

Then I take the intersection relation, which contains a left and right
tuple and I pass that through a UDF. This is similar to your #3 proposal,
only the UDF takes two tuples. It traverses them in parallel before
outputting a string representation of a bitmask of which tuple field
matched or missed. Group on the bitmasks to generate counts and you get a
report of all the different combos of field misses. All without a known
schema.

On Fri, Nov 30, 2012 at 12:49 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:

> Hi,
>
> As for point 1: it will always be cumbersome to work on such files. I would
> recommend using Avro where the schema is included in the file.
> Also you could try to sort contents or apply some transformation to force
> the files look the same. Then just diff the files outside of Pig, that's
> just an idea, I'm not sure whether it'll work for you.
>
> Thanks
>
>
> On Fri, Nov 30, 2012 at 5:48 PM, Clément MATHIEU <[EMAIL PROTECTED]
> >wrote:
>
> > Hi all,
> >
> > I'm trying to build a non regression testing tool to verify that the
> files
> > produced by two Pig scripts are equals.
> >
> > The files are in PigStorage format. The first field is a key and
> remaining
> > fields are opaque data (primitive or complex types).
> >
> > Example:
> >         1       43      {(10), (12), (14)}      {(55), (90)}    0
> 60
> >
> > I want to check that each key is present in  both or neither files, and
> > that
> > for each key the lines are equals. By being equals I mean logical
> equality
> > not string or byte equality. For example, the two following lines should
> be
> > equal:
> >         1       43      {(10), (12), (14)}      {(55), (90)}    0
> 60
> >         1       43      {(12), (10), (14)}      {(90), (55)}    0
> 60
> >
> >
> > My issue is that since this tool needs to operate on lot of different
> > files, it should not rely on a predefined schema. I experimented
> > the following idea:
> >
> > ------
> >         f1 = LOAD '$FILE1' USING PigStorage();
> >         f2 = LOAD '$FILE2' USING PigStorage();
> >
> >         g_f1 = GROUP f1 BY $0;
> >         g_f2 = GROUP f2 BY $0;
> >
> >         joined = JOIN
> >                 g_f1  by group full outer,
> >                 g_f2  by group;
> >
> >         cmp = FILTER joined by
> >                 g_f1::group is null
> >                 or  g_f2::group is null
> >                 or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;
> >
> >         dump cmp;
> > ------
> >
> > Unfortunately, since no schema is specified at load time, g_f1::f1 and
> > g_f2::f2 are instance of DataByteArray. It means that the DIFF function
> > does
> > not behave as wanted. A byte-to-byte comparison is performed rather than
> a
> > logical comparison. For example "1       {(2),(1)}" and "1
> {(1),(2)}"
> > are different since their byte representations are not the same.
> >
> > Do you know if a such tool already exist or how to write it ?
> >
> > I currently foresee three options:
> >
> >   1- Specify the schema. It could be done using scripting and a
> > file-to-schema
> >      mapping. The schema would be inserted using a variable. However the
> > schema
> >      of each file has to be described manually. This is a cumbersome
> > process.
> >   2- Use PigStorageSchema instead of PigStorage. I believe this would
> solve
> >      the issue; but being stuck with 0.8.1 I'm wondering if
> > PigStorageSchema
> >      is reasonably robust and side effect free to be used in production
> > scripts.
> >   3- Write a custom DIFF UDF taking two DataByteArray. This option allows
> > to not
> >      modify production scripts but I don't know how much effort is
> required
> >      to write a such UDF. Parsing the DataByteArray to rebuild a
> >      set/list/string structure seems quite easy. Do you think some part

*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*