Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> String Representation of DataBag and its Schema


+
Dan DeCapria, CivicScienc... 2013-03-18, 20:18
+
Jonathan Coveney 2013-03-18, 22:31
+
Dan DeCapria, CivicScienc... 2013-03-19, 13:37
+
Dan DeCapria, CivicScienc... 2013-03-19, 15:16
+
Jonathan Coveney 2013-03-19, 15:27
+
Dan DeCapria, CivicScienc... 2013-03-19, 15:37
+
Jonathan Coveney 2013-03-19, 15:43
+
Dan DeCapria, CivicScienc... 2013-03-19, 15:52
+
Jonathan Coveney 2013-03-19, 16:08
+
Dan DeCapria, CivicScienc... 2013-03-19, 16:40
+
Jonathan Coveney 2013-03-19, 16:53
+
Jonathan Coveney 2013-03-19, 16:54
Copy link to this message
-
Re: String Representation of DataBag and its Schema
I'll give it an honest try, and any additional from the community is
greatly appreciated!

I've been on this idea for a few days now.  I even implemented my own UDF
parser by converting the input to a char[] array and a push/popping on a
Stack of Node Objects to generate the nested inner complex DataTypes as a
Node tree. This worked well from a Node-linking standpoint, with a DFS
traversal on the Node tree to rebuild the DataBag Object. But it has
its caveats, as I have to create a UDF to generate the input for another
input, and it assumes the fields are type safe from elements "{(})#," which
isn't the case (ie, a serialized json chararray for a field). So I was
hoping for a more OTS solution using existing classes and methods given the
String and it's Schema a priori.

Thank you for your help, and I'll keep this post updated on my progress
towards a solution.

-Dan

On Tue, Mar 19, 2013 at 12:54 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> Ack, hit enter. I'd look at the LoadFunc interface, the PigSTorage class,
> and if you can't make it work without playing a little, let me know.
>
>
> 2013/3/19 Jonathan Coveney <[EMAIL PROTECTED]>
>
> > doing "new PigStorage()" is possible, but tricky. Maybe some of the other
> > contributors have an easier way of doing this, but in the short term, I'd
> > work on getting that to work. It's mainly just making sure you initialize
> > it properly.
> >
> >
> > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]>
> >
> >> This would work, but the goal would be to *not* invoke local interactive
> >> pig to execute a LOAD USING PigStorage() and pass the data into the UDF.
> >>  I
> >> was hoping to keep this completely in the Java and JUnit testing
> universe.
> >>
> >> Looking over the PigStorage()
> >> doc<
> >>
> https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
> >> >,
> >> would you know how to construct this process from a baseline PigStorage
> >> Object, such as:
> >>
> >> PigStorage pigstorage = new PigStorage();
> >>
> >> Any ideas?
> >>
> >> -Dan
> >>
> >> On Tue, Mar 19, 2013 at 12:08 PM, Jonathan Coveney <[EMAIL PROTECTED]
> >> >wrote:
> >>
> >> > I definitely understand the benefits, I just wanted to understand your
> >> > workflow so could weigh in with what I would do.
> >> >
> >> > In your case, if you're going to be making these by hand, then I would
> >> > mimic what PigStorage outputs, and then just load it in using
> >> PigStorage.
> >> >
> >> >
> >> > 2013/3/19 Dan DeCapria, CivicScience <[EMAIL PROTECTED]>
> >> >
> >> > > By hand; creating a new JUnit method to test a specific use case
> >> against
> >> > a
> >> > > functional requirement in the UDF.
> >> > >
> >> > > The UDFs I am testing are part of a larger ETL testing initiative I
> >> have
> >> > > been undertaking.  To ensure that the various states of legacy data
> >> are
> >> > > correctly extracted and transformed into a Pig context, I am
> creating
> >> > > specific JUnit tests per each UDF containing specific use cases as
> >> > testing
> >> > > methods.
> >> > >
> >> > > Motivation to use String inputs for the Data Objects and Schema
> >> Objects
> >> > is
> >> > > the improvement on the conventional approach - creating Java Objects
> >> and
> >> > > adding and appending nested Objects to create the desired complex
> type
> >> > > DataBag Object to pass to the UDF as use case input. This simpler
> >> process
> >> > > I'm looking for should improve scale-ability and rapid-prototyping
> >> within
> >> > > the testing scripts.  It will also make the process more
> approachable
> >> for
> >> > > another programmer to write additional unit tests.
> >> > >
> >> > > -Dan
> >> > >
> >> > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney <
> >> [EMAIL PROTECTED]
> >> > > >wrote:
> >> > >
> >> > > > How are you planning on generating these cases? By hand? Or
> >> automated?
> >> > > >
> >> > > >
> >> > > > 2013/3/19 Dan DeCapria, CivicScience <
> [EMAIL PROTECTED]

Dan DeCapria
CivicScience, Inc.
Senior Informatics / DM / ML / BI Specialist
+
William Oberman 2013-03-21, 15:51
+
Dan DeCapria, CivicScienc... 2013-03-19, 15:43
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB