Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - String Representation of DataBag and its Schema


Copy link to this message
-
Re: String Representation of DataBag and its Schema
Dan DeCapria, CivicScienc... 2013-03-19, 17:20
I'll give it an honest try, and any additional from the community is
greatly appreciated!

I've been on this idea for a few days now.  I even implemented my own UDF
parser by converting the input to a char[] array and a push/popping on a
Stack of Node Objects to generate the nested inner complex DataTypes as a
Node tree. This worked well from a Node-linking standpoint, with a DFS
traversal on the Node tree to rebuild the DataBag Object. But it has
its caveats, as I have to create a UDF to generate the input for another
input, and it assumes the fields are type safe from elements "{(})#," which
isn't the case (ie, a serialized json chararray for a field). So I was
hoping for a more OTS solution using existing classes and methods given the
String and it's Schema a priori.

Thank you for your help, and I'll keep this post updated on my progress
towards a solution.

-Dan

On Tue, Mar 19, 2013 at 12:54 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> Ack, hit enter. I'd look at the LoadFunc interface, the PigSTorage class,
> and if you can't make it work without playing a little, let me know.
>
>
> 2013/3/19 Jonathan Coveney <[EMAIL PROTECTED]>
>
> > doing "new PigStorage()" is possible, but tricky. Maybe some of the other
> > contributors have an easier way of doing this, but in the short term, I'd
> > work on getting that to work. It's mainly just making sure you initialize
> > it properly.
> >
> >
> > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]>
> >
> >> This would work, but the goal would be to *not* invoke local interactive
> >> pig to execute a LOAD USING PigStorage() and pass the data into the UDF.
> >>  I
> >> was hoping to keep this completely in the Java and JUnit testing
> universe.
> >>
> >> Looking over the PigStorage()
> >> doc<
> >>
> https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
> >> >,
> >> would you know how to construct this process from a baseline PigStorage
> >> Object, such as:
> >>
> >> PigStorage pigstorage = new PigStorage();
> >>
> >> Any ideas?
> >>
> >> -Dan
> >>
> >> On Tue, Mar 19, 2013 at 12:08 PM, Jonathan Coveney <[EMAIL PROTECTED]
> >> >wrote:
> >>
> >> > I definitely understand the benefits, I just wanted to understand your
> >> > workflow so could weigh in with what I would do.
> >> >
> >> > In your case, if you're going to be making these by hand, then I would
> >> > mimic what PigStorage outputs, and then just load it in using
> >> PigStorage.
> >> >
> >> >
> >> > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]>
> >> >
> >> > > By hand; creating a new JUnit method to test a specific use case
> >> against
> >> > a
> >> > > functional requirement in the UDF.
> >> > >
> >> > > The UDFs I am testing are part of a larger ETL testing initiative I
> >> have
> >> > > been undertaking.  To ensure that the various states of legacy data
> >> are
> >> > > correctly extracted and transformed into a Pig context, I am
> creating
> >> > > specific JUnit tests per each UDF containing specific use cases as
> >> > testing
> >> > > methods.
> >> > >
> >> > > Motivation to use String inputs for the Data Objects and Schema
> >> Objects
> >> > is
> >> > > the improvement on the conventional approach - creating Java Objects
> >> and
> >> > > adding and appending nested Objects to create the desired complex
> type
> >> > > DataBag Object to pass to the UDF as use case input. This simpler
> >> process
> >> > > I'm looking for should improve scale-ability and rapid-prototyping
> >> within
> >> > > the testing scripts.  It will also make the process more
> approachable
> >> for
> >> > > another programmer to write additional unit tests.
> >> > >
> >> > > -Dan
> >> > >
> >> > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney <
> >> [EMAIL PROTECTED]
> >> > > >wrote:
> >> > >
> >> > > > How are you planning on generating these cases? By hand? Or
> >> automated?
> >> > > >
> >> > > >
> >> > > > 2013/3/19 Dan DeCapria, CivicScience <
> [EMAIL PROTECTED]

Dan DeCapria
CivicScience, Inc.
Senior Informatics / DM / ML / BI Specialist