|
Vincent Barat
2010-07-01, 16:35
Dmitriy Ryaboy
2010-07-01, 16:47
Dmitriy Ryaboy
2010-07-01, 16:48
Ashutosh Chauhan
2010-07-01, 17:48
Dmitriy Ryaboy
2010-07-01, 17:52
Ashutosh Chauhan
2010-07-01, 18:03
Dave Viner
2010-07-01, 19:53
Dmitriy Ryaboy
2010-07-01, 20:16
Dave Viner
2010-07-01, 20:55
Dmitriy Ryaboy
2010-07-01, 21:15
Mridul Muralidharan
2010-07-07, 20:38
|
-
Re: UDF and rdbms lookupsVincent Barat 2010-07-01, 16:35
I do this in a static block of the udf class, or by initialazing static variables ... Maybe there is a better way, but Idon't know which one.
Dave Viner <[EMAIL PROTECTED]> a écrit : >In a custom UDF, what's the most appropriate way to initialize and connect >to a old-fashioned rdbms? > >I wrote a simple UDF which opens/closes a connection on each exec(), but >this feels a bit like overkill. Is there an "init()" method that is invoked >in a UDF to help with one-time initialization (like a database connection or >sql query preparation)? > >Thanks >Dave Viner
-
Re: UDF and rdbms lookupsDmitriy Ryaboy 2010-07-01, 16:47
The simplest thing you can do is to have database handle at the object
level, set it to null, and just initialize it in eval() if you see that it's null. You can also init the connection in the constructor. A static dbh will let you share it across tasks, if you persist the jvm. Naturally you will want to throw in some code to handle dropped connections and all that. On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[EMAIL PROTECTED]> wrote: > In a custom UDF, what's the most appropriate way to initialize and connect > to a old-fashioned rdbms? > > I wrote a simple UDF which opens/closes a connection on each exec(), but > this feels a bit like overkill. Is there an "init()" method that is > invoked > in a UDF to help with one-time initialization (like a database connection > or > sql query preparation)? > > Thanks > Dave Viner >
-
Re: UDF and rdbms lookupsDmitriy Ryaboy 2010-07-01, 16:48
Also -- I hope your cluster is not too big. It's really easy to DDOS your
database using hadoop. On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > The simplest thing you can do is to have database handle at the object > level, set it to null, and just initialize it in eval() if you see that it's > null. > You can also init the connection in the constructor. > A static dbh will let you share it across tasks, if you persist the jvm. > Naturally you will want to throw in some code to handle dropped connections > and all that. > > > > On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[EMAIL PROTECTED]> wrote: > >> In a custom UDF, what's the most appropriate way to initialize and connect >> to a old-fashioned rdbms? >> >> I wrote a simple UDF which opens/closes a connection on each exec(), but >> this feels a bit like overkill. Is there an "init()" method that is >> invoked >> in a UDF to help with one-time initialization (like a database connection >> or >> sql query preparation)? >> >> Thanks >> Dave Viner >> > >
-
Re: UDF and rdbms lookupsAshutosh Chauhan 2010-07-01, 17:48
There is an uncommitted Piggybank UDF which may help you.
https://issues.apache.org/jira/browse/PIG-1229 You can try the first patch ( pig-1229.2.patch by Ankur ) listed on the page It does a different thing of writing rows from Pig into the DB. But DB connection part you can borrow from it. Note to self: I really want to get this patch committed before more people reinvent the wheel of making Pig talk to DB. On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Also -- I hope your cluster is not too big. It's really easy to DDOS your > database using hadoop. > > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> The simplest thing you can do is to have database handle at the object >> level, set it to null, and just initialize it in eval() if you see that it's >> null. >> You can also init the connection in the constructor. >> A static dbh will let you share it across tasks, if you persist the jvm. >> Naturally you will want to throw in some code to handle dropped connections >> and all that. >> >> >> >> On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[EMAIL PROTECTED]> wrote: >> >>> In a custom UDF, what's the most appropriate way to initialize and connect >>> to a old-fashioned rdbms? >>> >>> I wrote a simple UDF which opens/closes a connection on each exec(), but >>> this feels a bit like overkill. Is there an "init()" method that is >>> invoked >>> in a UDF to help with one-time initialization (like a database connection >>> or >>> sql query preparation)? >>> >>> Thanks >>> Dave Viner >>> >> >> >
-
Re: UDF and rdbms lookupsDmitriy Ryaboy 2010-07-01, 17:52
Can you put a LOG.info and javadoc into this patch saying "watch out, DB
connection bomb being deployed"? :) On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan < [EMAIL PROTECTED]> wrote: > There is an uncommitted Piggybank UDF which may help you. > https://issues.apache.org/jira/browse/PIG-1229 You can try the first > patch ( pig-1229.2.patch by Ankur ) listed on the page It does a > different thing of writing rows from Pig into the DB. But DB > connection part you can borrow from it. > > Note to self: I really want to get this patch committed before more > people reinvent the wheel of making Pig talk to DB. > > On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > > Also -- I hope your cluster is not too big. It's really easy to DDOS your > > database using hadoop. > > > > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > >> The simplest thing you can do is to have database handle at the object > >> level, set it to null, and just initialize it in eval() if you see that > it's > >> null. > >> You can also init the connection in the constructor. > >> A static dbh will let you share it across tasks, if you persist the jvm. > >> Naturally you will want to throw in some code to handle dropped > connections > >> and all that. > >> > >> > >> > >> On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[EMAIL PROTECTED]> wrote: > >> > >>> In a custom UDF, what's the most appropriate way to initialize and > connect > >>> to a old-fashioned rdbms? > >>> > >>> I wrote a simple UDF which opens/closes a connection on each exec(), > but > >>> this feels a bit like overkill. Is there an "init()" method that is > >>> invoked > >>> in a UDF to help with one-time initialization (like a database > connection > >>> or > >>> sql query preparation)? > >>> > >>> Thanks > >>> Dave Viner > >>> > >> > >> > > >
-
Re: UDF and rdbms lookupsAshutosh Chauhan 2010-07-01, 18:03
That will be a day of rejoice when a multi-million Oracle deployment
comes to a grinding halt by tiny-weeny 4 line pig script. *wink* ;) Ashutosh On Thu, Jul 1, 2010 at 10:52, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Can you put a LOG.info and javadoc into this patch saying "watch out, DB > connection bomb being deployed"? :) > > On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan < > [EMAIL PROTECTED]> wrote: > >> There is an uncommitted Piggybank UDF which may help you. >> https://issues.apache.org/jira/browse/PIG-1229 You can try the first >> patch ( pig-1229.2.patch by Ankur ) listed on the page It does a >> different thing of writing rows from Pig into the DB. But DB >> connection part you can borrow from it. >> >> Note to self: I really want to get this patch committed before more >> people reinvent the wheel of making Pig talk to DB. >> >> On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> > Also -- I hope your cluster is not too big. It's really easy to DDOS your >> > database using hadoop. >> > >> > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> >> wrote: >> > >> >> The simplest thing you can do is to have database handle at the object >> >> level, set it to null, and just initialize it in eval() if you see that >> it's >> >> null. >> >> You can also init the connection in the constructor. >> >> A static dbh will let you share it across tasks, if you persist the jvm. >> >> Naturally you will want to throw in some code to handle dropped >> connections >> >> and all that. >> >> >> >> >> >> >> >> On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[EMAIL PROTECTED]> wrote: >> >> >> >>> In a custom UDF, what's the most appropriate way to initialize and >> connect >> >>> to a old-fashioned rdbms? >> >>> >> >>> I wrote a simple UDF which opens/closes a connection on each exec(), >> but >> >>> this feels a bit like overkill. Is there an "init()" method that is >> >>> invoked >> >>> in a UDF to help with one-time initialization (like a database >> connection >> >>> or >> >>> sql query preparation)? >> >>> >> >>> Thanks >> >>> Dave Viner >> >>> >> >> >> >> >> > >> >
-
Re: UDF and rdbms lookupsDave Viner 2010-07-01, 19:53
@Dmitriy, you mentioned an eval() method... is that part of the UDF? Or do
you mean exec() ? I think my confusion may be that I'm not clear on the actual steps taken when a UDF is invoked. Clearly, the key step is to invoke the exec(Tuple input) method. But, it would appear that an object is instantiated first. Are there any parameters passed to the constructor? Or is there any way to influence those parameters? Also, how many objects would be constructed? Is it one for each invocation of the UDF? Or one for each process managing the map/reduce? @Ashutosh, this is a neat patch. Reading/writing to a DB would be super helpful from within Pig. But, I don't have enough Pig experience to know how to translate a StoreFunc into a EvalFunc. In your code, the constructor sets up the variables and then the prepareToWrite actually handle the connection to the database. Is there some similar call in an EvalFunc which is like a "prepareToExec" ? Thanks Dave Viner On Thu, Jul 1, 2010 at 11:03 AM, Ashutosh Chauhan < [EMAIL PROTECTED]> wrote: > That will be a day of rejoice when a multi-million Oracle deployment > comes to a grinding halt by tiny-weeny 4 line pig script. *wink* ;) > > Ashutosh > On Thu, Jul 1, 2010 at 10:52, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > > Can you put a LOG.info and javadoc into this patch saying "watch out, DB > > connection bomb being deployed"? :) > > > > On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan < > > [EMAIL PROTECTED]> wrote: > > > >> There is an uncommitted Piggybank UDF which may help you. > >> https://issues.apache.org/jira/browse/PIG-1229 You can try the first > >> patch ( pig-1229.2.patch by Ankur ) listed on the page It does a > >> different thing of writing rows from Pig into the DB. But DB > >> connection part you can borrow from it. > >> > >> Note to self: I really want to get this patch committed before more > >> people reinvent the wheel of making Pig talk to DB. > >> > >> On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > >> > Also -- I hope your cluster is not too big. It's really easy to DDOS > your > >> > database using hadoop. > >> > > >> > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > >> wrote: > >> > > >> >> The simplest thing you can do is to have database handle at the > object > >> >> level, set it to null, and just initialize it in eval() if you see > that > >> it's > >> >> null. > >> >> You can also init the connection in the constructor. > >> >> A static dbh will let you share it across tasks, if you persist the > jvm. > >> >> Naturally you will want to throw in some code to handle dropped > >> connections > >> >> and all that. > >> >> > >> >> > >> >> > >> >> On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[EMAIL PROTECTED]> > wrote: > >> >> > >> >>> In a custom UDF, what's the most appropriate way to initialize and > >> connect > >> >>> to a old-fashioned rdbms? > >> >>> > >> >>> I wrote a simple UDF which opens/closes a connection on each exec(), > >> but > >> >>> this feels a bit like overkill. Is there an "init()" method that is > >> >>> invoked > >> >>> in a UDF to help with one-time initialization (like a database > >> connection > >> >>> or > >> >>> sql query preparation)? > >> >>> > >> >>> Thanks > >> >>> Dave Viner > >> >>> > >> >> > >> >> > >> > > >> > > >
-
Re: UDF and rdbms lookupsDmitriy Ryaboy 2010-07-01, 20:16
Yes, I mean exec().
The constructor will be called "at least 1 time". It will not be called once per tuple -- the UDF object is created when the data starts flowing, and is destroyed when it stops. So you can put things into the constructor. By default, a no-argument constructor gets invoked. You can make Pig use a constructor that takes string arguments (strings only!) by "defining" a function, like so: DEFINE MyFunction com.my.company.MyFunction('foo', 'bar') [...] foobar = FOREACH some_relation GENERATE MyFunction(some_field); This will cause the relation foobar to get populated by the results of calling MyFunction.exec on some_field of every tuple in some_relation, with MyFunction having been instantiated using the arguments 'foo' and 'bar'. The instantiation will happen a few times on the client-side (your machine), while Pig tries to compile the program and send it to Hadoop, and one or more times per task in Hadoop (in practice, you can pretend it's just once per task). -Dmitriy On Thu, Jul 1, 2010 at 12:53 PM, Dave Viner <[EMAIL PROTECTED]> wrote: > @Dmitriy, you mentioned an eval() method... is that part of the UDF? Or do > you mean exec() ? > > I think my confusion may be that I'm not clear on the actual steps taken > when a UDF is invoked. Clearly, the key step is to invoke the exec(Tuple > input) method. But, it would appear that an object is instantiated first. > Are there any parameters passed to the constructor? Or is there any way > to > influence those parameters? > > Also, how many objects would be constructed? Is it one for each invocation > of the UDF? Or one for each process managing the map/reduce? > > @Ashutosh, this is a neat patch. Reading/writing to a DB would be super > helpful from within Pig. But, I don't have enough Pig experience to know > how to translate a StoreFunc into a EvalFunc. In your code, the > constructor > sets up the variables and then the prepareToWrite actually handle the > connection to the database. Is there some similar call in an EvalFunc which > is like a "prepareToExec" ? > > Thanks > Dave Viner > > > On Thu, Jul 1, 2010 at 11:03 AM, Ashutosh Chauhan < > [EMAIL PROTECTED]> wrote: > > > That will be a day of rejoice when a multi-million Oracle deployment > > comes to a grinding halt by tiny-weeny 4 line pig script. *wink* ;) > > > > Ashutosh > > On Thu, Jul 1, 2010 at 10:52, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > > > Can you put a LOG.info and javadoc into this patch saying "watch out, > DB > > > connection bomb being deployed"? :) > > > > > > On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan < > > > [EMAIL PROTECTED]> wrote: > > > > > >> There is an uncommitted Piggybank UDF which may help you. > > >> https://issues.apache.org/jira/browse/PIG-1229 You can try the first > > >> patch ( pig-1229.2.patch by Ankur ) listed on the page It does > a > > >> different thing of writing rows from Pig into the DB. But DB > > >> connection part you can borrow from it. > > >> > > >> Note to self: I really want to get this patch committed before more > > >> people reinvent the wheel of making Pig talk to DB. > > >> > > >> On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > >> > Also -- I hope your cluster is not too big. It's really easy to DDOS > > your > > >> > database using hadoop. > > >> > > > >> > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > >> wrote: > > >> > > > >> >> The simplest thing you can do is to have database handle at the > > object > > >> >> level, set it to null, and just initialize it in eval() if you see > > that > > >> it's > > >> >> null. > > >> >> You can also init the connection in the constructor. > > >> >> A static dbh will let you share it across tasks, if you persist the > > jvm. > > >> >> Naturally you will want to throw in some code to handle dropped > > >> connections > > >> >> and all that. > > >> >> > > >> >> > > >> >> > > >> >> On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[EMAIL PROTECTED]>
-
Re: UDF and rdbms lookupsDave Viner 2010-07-01, 20:55
Hi Dmitriy,
Thanks! This is very helpful! Is there a method that gets called with the UDF object is being destroyed? Something that allows for cleanup? Thanks again. Dave Viner On Thu, Jul 1, 2010 at 1:16 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Yes, I mean exec(). > > The constructor will be called "at least 1 time". It will not be called > once > per tuple -- the UDF object is created when the data starts flowing, and is > destroyed when it stops. So you can put things into the constructor. > > By default, a no-argument constructor gets invoked. You can make Pig use a > constructor that takes string arguments (strings only!) by "defining" a > function, like so: > > DEFINE MyFunction com.my.company.MyFunction('foo', 'bar') > > [...] > > foobar = FOREACH some_relation GENERATE MyFunction(some_field); > > This will cause the relation foobar to get populated by the results of > calling MyFunction.exec on some_field of every tuple in some_relation, with > MyFunction having been instantiated using the arguments 'foo' and 'bar'. > The instantiation will happen a few times on the client-side (your > machine), > while Pig tries to compile the program and send it to Hadoop, and one or > more times per task in Hadoop (in practice, you can pretend it's just once > per task). > > -Dmitriy > > On Thu, Jul 1, 2010 at 12:53 PM, Dave Viner <[EMAIL PROTECTED]> wrote: > > > @Dmitriy, you mentioned an eval() method... is that part of the UDF? Or > do > > you mean exec() ? > > > > I think my confusion may be that I'm not clear on the actual steps taken > > when a UDF is invoked. Clearly, the key step is to invoke the exec(Tuple > > input) method. But, it would appear that an object is instantiated > first. > > Are there any parameters passed to the constructor? Or is there any way > > to > > influence those parameters? > > > > Also, how many objects would be constructed? Is it one for each > invocation > > of the UDF? Or one for each process managing the map/reduce? > > > > @Ashutosh, this is a neat patch. Reading/writing to a DB would be super > > helpful from within Pig. But, I don't have enough Pig experience to know > > how to translate a StoreFunc into a EvalFunc. In your code, the > > constructor > > sets up the variables and then the prepareToWrite actually handle the > > connection to the database. Is there some similar call in an EvalFunc > which > > is like a "prepareToExec" ? > > > > Thanks > > Dave Viner > > > > > > On Thu, Jul 1, 2010 at 11:03 AM, Ashutosh Chauhan < > > [EMAIL PROTECTED]> wrote: > > > > > That will be a day of rejoice when a multi-million Oracle deployment > > > comes to a grinding halt by tiny-weeny 4 line pig script. *wink* ;) > > > > > > Ashutosh > > > On Thu, Jul 1, 2010 at 10:52, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > > Can you put a LOG.info and javadoc into this patch saying "watch out, > > DB > > > > connection bomb being deployed"? :) > > > > > > > > On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > >> There is an uncommitted Piggybank UDF which may help you. > > > >> https://issues.apache.org/jira/browse/PIG-1229 You can try the > first > > > >> patch ( pig-1229.2.patch by Ankur ) listed on the page It > does > > a > > > >> different thing of writing rows from Pig into the DB. But DB > > > >> connection part you can borrow from it. > > > >> > > > >> Note to self: I really want to get this patch committed before more > > > >> people reinvent the wheel of making Pig talk to DB. > > > >> > > > >> On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > wrote: > > > >> > Also -- I hope your cluster is not too big. It's really easy to > DDOS > > > your > > > >> > database using hadoop. > > > >> > > > > >> > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy < > [EMAIL PROTECTED]> > > > >> wrote: > > > >> > > > > >> >> The simplest thing you can do is to have database handle at the > >
-
Re: UDF and rdbms lookupsDmitriy Ryaboy 2010-07-01, 21:15
Yep, it's the finish() method
See javadocs: http://hadoop.apache.org/pig/javadoc/docs/api/org/apache/pig/EvalFunc.html On Thu, Jul 1, 2010 at 1:55 PM, Dave Viner <[EMAIL PROTECTED]> wrote: > Hi Dmitriy, > > Thanks! This is very helpful! > > Is there a method that gets called with the UDF object is being destroyed? > Something that allows for cleanup? > > Thanks again. > Dave Viner > > > On Thu, Jul 1, 2010 at 1:16 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > > > Yes, I mean exec(). > > > > The constructor will be called "at least 1 time". It will not be called > > once > > per tuple -- the UDF object is created when the data starts flowing, and > is > > destroyed when it stops. So you can put things into the constructor. > > > > By default, a no-argument constructor gets invoked. You can make Pig use > a > > constructor that takes string arguments (strings only!) by "defining" a > > function, like so: > > > > DEFINE MyFunction com.my.company.MyFunction('foo', 'bar') > > > > [...] > > > > foobar = FOREACH some_relation GENERATE MyFunction(some_field); > > > > This will cause the relation foobar to get populated by the results of > > calling MyFunction.exec on some_field of every tuple in some_relation, > with > > MyFunction having been instantiated using the arguments 'foo' and 'bar'. > > The instantiation will happen a few times on the client-side (your > > machine), > > while Pig tries to compile the program and send it to Hadoop, and one or > > more times per task in Hadoop (in practice, you can pretend it's just > once > > per task). > > > > -Dmitriy > > > > On Thu, Jul 1, 2010 at 12:53 PM, Dave Viner <[EMAIL PROTECTED]> wrote: > > > > > @Dmitriy, you mentioned an eval() method... is that part of the UDF? > Or > > do > > > you mean exec() ? > > > > > > I think my confusion may be that I'm not clear on the actual steps > taken > > > when a UDF is invoked. Clearly, the key step is to invoke the > exec(Tuple > > > input) method. But, it would appear that an object is instantiated > > first. > > > Are there any parameters passed to the constructor? Or is there any > way > > > to > > > influence those parameters? > > > > > > Also, how many objects would be constructed? Is it one for each > > invocation > > > of the UDF? Or one for each process managing the map/reduce? > > > > > > @Ashutosh, this is a neat patch. Reading/writing to a DB would be > super > > > helpful from within Pig. But, I don't have enough Pig experience to > know > > > how to translate a StoreFunc into a EvalFunc. In your code, the > > > constructor > > > sets up the variables and then the prepareToWrite actually handle the > > > connection to the database. Is there some similar call in an EvalFunc > > which > > > is like a "prepareToExec" ? > > > > > > Thanks > > > Dave Viner > > > > > > > > > On Thu, Jul 1, 2010 at 11:03 AM, Ashutosh Chauhan < > > > [EMAIL PROTECTED]> wrote: > > > > > > > That will be a day of rejoice when a multi-million Oracle deployment > > > > comes to a grinding halt by tiny-weeny 4 line pig script. *wink* ;) > > > > > > > > Ashutosh > > > > On Thu, Jul 1, 2010 at 10:52, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > > > > Can you put a LOG.info and javadoc into this patch saying "watch > out, > > > DB > > > > > connection bomb being deployed"? :) > > > > > > > > > > On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan < > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > >> There is an uncommitted Piggybank UDF which may help you. > > > > >> https://issues.apache.org/jira/browse/PIG-1229 You can try the > > first > > > > >> patch ( pig-1229.2.patch by Ankur ) listed on the page It > > does > > > a > > > > >> different thing of writing rows from Pig into the DB. But DB > > > > >> connection part you can borrow from it. > > > > >> > > > > >> Note to self: I really want to get this patch committed before > more > > > > >> people reinvent the wheel of making Pig talk to DB. > > > > >> > > > > >> On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[EMAIL PROTECTED]>
-
Re: UDF and rdbms lookupsMridul Muralidharan 2010-07-07, 20:38
You will need to look at lifecycle of a udf to better understand this. Typically they are created (note: one or more creations !) during plan creation time (before job submission) and subsequently deserialized on the various mapper/reducer nodes to get executed (iirc). So typically what I have in my code path is : ---- cut start --- // default will be false boolean transient initialized = false; exec(){ if (!initialized) doInit(); ... } doInit(){ // acquire resources (sockets, rdbms conn, etc) , initialize state (create directory/files, copy from hdfs to local, etc). } ---- cut end --- If I am not wrong, each udf invocation in pig results in a new udf getting created - so use with care (you can have M * N rdbms connections if there are M mappers and N invocations in a mapred job) Regards, Mridul On Thursday 01 July 2010 09:31 PM, Dave Viner wrote: > In a custom UDF, what's the most appropriate way to initialize and connect > to a old-fashioned rdbms? > > I wrote a simple UDF which opens/closes a connection on each exec(), but > this feels a bit like overkill. Is there an "init()" method that is invoked > in a UDF to help with one-time initialization (like a database connection or > sql query preparation)? > > Thanks > Dave Viner |