Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> How to perfom a logical diff on two PigStorage files


+
Clément MATHIEU 2012-11-30, 13:48
+
Ruslan Al-Fakikh 2012-11-30, 20:49
Copy link to this message
-
Re: How to perfom a logical diff on two PigStorage files
I've done this in two passes. First I do an intersection test and determine
the outer misses by join key on each side, similar to what you've done. I
then store the left_only and right_only side for further inspection.

Then I take the intersection relation, which contains a left and right
tuple and I pass that through a UDF. This is similar to your #3 proposal,
only the UDF takes two tuples. It traverses them in parallel before
outputting a string representation of a bitmask of which tuple field
matched or missed. Group on the bitmasks to generate counts and you get a
report of all the different combos of field misses. All without a known
schema.

On Fri, Nov 30, 2012 at 12:49 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:

> Hi,
>
> As for point 1: it will always be cumbersome to work on such files. I would
> recommend using Avro where the schema is included in the file.
> Also you could try to sort contents or apply some transformation to force
> the files look the same. Then just diff the files outside of Pig, that's
> just an idea, I'm not sure whether it'll work for you.
>
> Thanks
>
>
> On Fri, Nov 30, 2012 at 5:48 PM, Clément MATHIEU <[EMAIL PROTECTED]
> >wrote:
>
> > Hi all,
> >
> > I'm trying to build a non regression testing tool to verify that the
> files
> > produced by two Pig scripts are equals.
> >
> > The files are in PigStorage format. The first field is a key and
> remaining
> > fields are opaque data (primitive or complex types).
> >
> > Example:
> >         1       43      {(10), (12), (14)}      {(55), (90)}    0
> 60
> >
> > I want to check that each key is present in  both or neither files, and
> > that
> > for each key the lines are equals. By being equals I mean logical
> equality
> > not string or byte equality. For example, the two following lines should
> be
> > equal:
> >         1       43      {(10), (12), (14)}      {(55), (90)}    0
> 60
> >         1       43      {(12), (10), (14)}      {(90), (55)}    0
> 60
> >
> >
> > My issue is that since this tool needs to operate on lot of different
> > files, it should not rely on a predefined schema. I experimented
> > the following idea:
> >
> > ------
> >         f1 = LOAD '$FILE1' USING PigStorage();
> >         f2 = LOAD '$FILE2' USING PigStorage();
> >
> >         g_f1 = GROUP f1 BY $0;
> >         g_f2 = GROUP f2 BY $0;
> >
> >         joined = JOIN
> >                 g_f1  by group full outer,
> >                 g_f2  by group;
> >
> >         cmp = FILTER joined by
> >                 g_f1::group is null
> >                 or  g_f2::group is null
> >                 or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;
> >
> >         dump cmp;
> > ------
> >
> > Unfortunately, since no schema is specified at load time, g_f1::f1 and
> > g_f2::f2 are instance of DataByteArray. It means that the DIFF function
> > does
> > not behave as wanted. A byte-to-byte comparison is performed rather than
> a
> > logical comparison. For example "1       {(2),(1)}" and "1
> {(1),(2)}"
> > are different since their byte representations are not the same.
> >
> > Do you know if a such tool already exist or how to write it ?
> >
> > I currently foresee three options:
> >
> >   1- Specify the schema. It could be done using scripting and a
> > file-to-schema
> >      mapping. The schema would be inserted using a variable. However the
> > schema
> >      of each file has to be described manually. This is a cumbersome
> > process.
> >   2- Use PigStorageSchema instead of PigStorage. I believe this would
> solve
> >      the issue; but being stuck with 0.8.1 I'm wondering if
> > PigStorageSchema
> >      is reasonably robust and side effect free to be used in production
> > scripts.
> >   3- Write a custom DIFF UDF taking two DataByteArray. This option allows
> > to not
> >      modify production scripts but I don't know how much effort is
> required
> >      to write a such UDF. Parsing the DataByteArray to rebuild a
> >      set/list/string structure seems quite easy. Do you think some part

*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB