|
|
-
Best practice for automating jobs
Tom Brown 2013-01-10, 22:03
All,
I want to automate jobs against Hive (using an external table with ever growing partitions), and I'm running into a few challenges:
Concurrency - If I run Hive as a thrift server, I can only safely run one job at a time. As such, it seems like my best bet will be to run it from the command line and setup a brand new instance for each job. That quite a bit of a hassle to solves a seemingly common problem, so I want to know if there are any accepted patterns or best practices for this?
Partition management - New partitions will be added regularly. If I have to setup multiple instances of Hive for each (potentially) overlapping job, it will be difficult to keep track of the partitions that have been added. In the context of the preceding question, what is the best way to add metadata about new partitions?
Thanks in advance!
--Tom
-
Re: Best practice for automating jobs
Sean McNamara 2013-01-10, 22:11
> I want to know if there are any accepted patterns or best practices for >this? http://oozie.apache.org/> New partitions will be added regularly What type of partitions are you adding? Why frequently? Sean On 1/10/13 3:03 PM, "Tom Brown" <[EMAIL PROTECTED]> wrote: >All, > >I want to automate jobs against Hive (using an external table with >ever growing partitions), and I'm running into a few challenges: > >Concurrency - If I run Hive as a thrift server, I can only safely run >one job at a time. As such, it seems like my best bet will be to run >it from the command line and setup a brand new instance for each job. >That quite a bit of a hassle to solves a seemingly common problem, so >I want to know if there are any accepted patterns or best practices >for this? > >Partition management - New partitions will be added regularly. If I >have to setup multiple instances of Hive for each (potentially) >overlapping job, it will be difficult to keep track of the partitions >that have been added. In the context of the preceding question, what >is the best way to add metadata about new partitions? > >Thanks in advance! > >--Tom
-
Re: Best practice for automating jobs
Dean Wampler 2013-01-10, 22:30
If you know make and bash, have a look at Stampede for scheduling work: https://github.com/ThinkBigAnalytics/stampede(Full disclosure: I wrote it) On Thu, Jan 10, 2013 at 4:11 PM, Sean McNamara <[EMAIL PROTECTED]>wrote: > > I want to know if there are any accepted patterns or best practices for > >this? > > http://oozie.apache.org/> > > > With both Stampede and Oozie, you can tell them to watch for certain data to show up, e.g., a _SUCCESS file marker in a directory getting new data files, and then start a Hive query, etc. You can also add your partition creation commands in the workflow, e.g., as soon as the data is present (or even before; Hive won't care if it doesn't exist yet). > > New partitions will be added regularly > > When you add a partition, that metadata goes into the metastore, so every hive instance sharing that metastore will see it. Of course, you should avoid scenarios where multiple processes attempt to create the same partition, although if they are using exactly the same command, then adding an IF NOT EXISTS clause will avoid error messages. Still, I wouldn't want to torture test the metastore... > What type of partitions are you adding? Why frequently? > > > > > Sean > > > On 1/10/13 3:03 PM, "Tom Brown" <[EMAIL PROTECTED]> wrote: > > >All, > > > >I want to automate jobs against Hive (using an external table with > >ever growing partitions), and I'm running into a few challenges: > > > >Concurrency - If I run Hive as a thrift server, I can only safely run > >one job at a time. As such, it seems like my best bet will be to run > >it from the command line and setup a brand new instance for each job. > >That quite a bit of a hassle to solves a seemingly common problem, so > >I want to know if there are any accepted patterns or best practices > >for this? > > > >Partition management - New partitions will be added regularly. If I > >have to setup multiple instances of Hive for each (potentially) > >overlapping job, it will be difficult to keep track of the partitions > >that have been added. In the context of the preceding question, what > >is the best way to add metadata about new partitions? > > > >Thanks in advance! > > > >--Tom > > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330
-
Re: Best practice for automating jobs
Qiang Wang 2013-01-11, 01:31
I believe the HWI (Hive Web Interface) can give you a hand. https://github.com/anjuke/hwiYou can use the HWI to submit and run queries concurrently. Partition management can be achieved by creating crontabs using the HWI. It's simple and easy to use. Hope it helps. Regards, Qiang 2013/1/11 Tom Brown <[EMAIL PROTECTED]> > All, > > I want to automate jobs against Hive (using an external table with > ever growing partitions), and I'm running into a few challenges: > > Concurrency - If I run Hive as a thrift server, I can only safely run > one job at a time. As such, it seems like my best bet will be to run > it from the command line and setup a brand new instance for each job. > That quite a bit of a hassle to solves a seemingly common problem, so > I want to know if there are any accepted patterns or best practices > for this? > > Partition management - New partitions will be added regularly. If I > have to setup multiple instances of Hive for each (potentially) > overlapping job, it will be difficult to keep track of the partitions > that have been added. In the context of the preceding question, what > is the best way to add metadata about new partitions? > > Thanks in advance! > > --Tom >
-
Re: Best practice for automating jobs
Tom Brown 2013-01-11, 02:55
How is concurrency achieved with this solution? On Thursday, January 10, 2013, Qiang Wang wrote: > I believe the HWI (Hive Web Interface) can give you a hand. > > https://github.com/anjuke/hwi> > You can use the HWI to submit and run queries concurrently. > Partition management can be achieved by creating crontabs using the HWI. > > It's simple and easy to use. Hope it helps. > > Regards, > Qiang > > > 2013/1/11 Tom Brown <[EMAIL PROTECTED] <javascript:_e({}, 'cvml', > '[EMAIL PROTECTED]');>> > >> All, >> >> I want to automate jobs against Hive (using an external table with >> ever growing partitions), and I'm running into a few challenges: >> >> Concurrency - If I run Hive as a thrift server, I can only safely run >> one job at a time. As such, it seems like my best bet will be to run >> it from the command line and setup a brand new instance for each job. >> That quite a bit of a hassle to solves a seemingly common problem, so >> I want to know if there are any accepted patterns or best practices >> for this? >> >> Partition management - New partitions will be added regularly. If I >> have to setup multiple instances of Hive for each (potentially) >> overlapping job, it will be difficult to keep track of the partitions >> that have been added. In the context of the preceding question, what >> is the best way to add metadata about new partitions? >> >> Thanks in advance! >> >> --Tom >> > >
-
Re: Best practice for automating jobs
Qiang Wang 2013-01-11, 03:06
The HWI will create a cli session for each query through hive libs, so several queries can run concurrently. 2013/1/11 Tom Brown <[EMAIL PROTECTED]> > How is concurrency achieved with this solution? > > > On Thursday, January 10, 2013, Qiang Wang wrote: > >> I believe the HWI (Hive Web Interface) can give you a hand. >> >> https://github.com/anjuke/hwi>> >> You can use the HWI to submit and run queries concurrently. >> Partition management can be achieved by creating crontabs using the HWI. >> >> It's simple and easy to use. Hope it helps. >> >> Regards, >> Qiang >> >> >> 2013/1/11 Tom Brown <[EMAIL PROTECTED]> >> >>> All, >>> >>> I want to automate jobs against Hive (using an external table with >>> ever growing partitions), and I'm running into a few challenges: >>> >>> Concurrency - If I run Hive as a thrift server, I can only safely run >>> one job at a time. As such, it seems like my best bet will be to run >>> it from the command line and setup a brand new instance for each job. >>> That quite a bit of a hassle to solves a seemingly common problem, so >>> I want to know if there are any accepted patterns or best practices >>> for this? >>> >>> Partition management - New partitions will be added regularly. If I >>> have to setup multiple instances of Hive for each (potentially) >>> overlapping job, it will be difficult to keep track of the partitions >>> that have been added. In the context of the preceding question, what >>> is the best way to add metadata about new partitions? >>> >>> Thanks in advance! >>> >>> --Tom >>> >> >>
-
Re: Best practice for automating jobs
Tom Brown 2013-01-11, 03:17
When I've tried to create concurrent CLI sessions, I thought the 2nd one got an error about not being able to lock the metadata store. Is that error a real thing, or have I been mistaken this whole time? --Tom On Thursday, January 10, 2013, Qiang Wang wrote: > The HWI will create a cli session for each query through hive libs, so > several queries can run concurrently. > > > 2013/1/11 Tom Brown <[EMAIL PROTECTED] <javascript:_e({}, 'cvml', > '[EMAIL PROTECTED]');>> > >> How is concurrency achieved with this solution? >> >> >> On Thursday, January 10, 2013, Qiang Wang wrote: >> >>> I believe the HWI (Hive Web Interface) can give you a hand. >>> >>> https://github.com/anjuke/hwi>>> >>> You can use the HWI to submit and run queries concurrently. >>> Partition management can be achieved by creating crontabs using the HWI. >>> >>> It's simple and easy to use. Hope it helps. >>> >>> Regards, >>> Qiang >>> >>> >>> 2013/1/11 Tom Brown <[EMAIL PROTECTED]> >>> >>>> All, >>>> >>>> I want to automate jobs against Hive (using an external table with >>>> ever growing partitions), and I'm running into a few challenges: >>>> >>>> Concurrency - If I run Hive as a thrift server, I can only safely run >>>> one job at a time. As such, it seems like my best bet will be to run >>>> it from the command line and setup a brand new instance for each job. >>>> That quite a bit of a hassle to solves a seemingly common problem, so >>>> I want to know if there are any accepted patterns or best practices >>>> for this? >>>> >>>> Partition management - New partitions will be added regularly. If I >>>> have to setup multiple instances of Hive for each (potentially) >>>> overlapping job, it will be difficult to keep track of the partitions >>>> that have been added. In the context of the preceding question, what >>>> is the best way to add metadata about new partitions? >>>> >>>> Thanks in advance! >>>> >>>> --Tom >>>> >>> >>> >
-
Re: Best practice for automating jobs
Qiang Wang 2013-01-11, 03:22
Are you using Embedded Metastore ? Only one process can connect to this metastore at a time. 2013/1/11 Tom Brown <[EMAIL PROTECTED]> > When I've tried to create concurrent CLI sessions, I thought the 2nd > one got an error about not being able to lock the metadata store. > > Is that error a real thing, or have I been mistaken this whole time? > > --Tom > > > On Thursday, January 10, 2013, Qiang Wang wrote: > >> The HWI will create a cli session for each query through hive libs, so >> several queries can run concurrently. >> >> >> 2013/1/11 Tom Brown <[EMAIL PROTECTED]> >> >>> How is concurrency achieved with this solution? >>> >>> >>> On Thursday, January 10, 2013, Qiang Wang wrote: >>> >>>> I believe the HWI (Hive Web Interface) can give you a hand. >>>> >>>> https://github.com/anjuke/hwi>>>> >>>> You can use the HWI to submit and run queries concurrently. >>>> Partition management can be achieved by creating crontabs using the HWI. >>>> >>>> It's simple and easy to use. Hope it helps. >>>> >>>> Regards, >>>> Qiang >>>> >>>> >>>> 2013/1/11 Tom Brown <[EMAIL PROTECTED]> >>>> >>>>> All, >>>>> >>>>> I want to automate jobs against Hive (using an external table with >>>>> ever growing partitions), and I'm running into a few challenges: >>>>> >>>>> Concurrency - If I run Hive as a thrift server, I can only safely run >>>>> one job at a time. As such, it seems like my best bet will be to run >>>>> it from the command line and setup a brand new instance for each job. >>>>> That quite a bit of a hassle to solves a seemingly common problem, so >>>>> I want to know if there are any accepted patterns or best practices >>>>> for this? >>>>> >>>>> Partition management - New partitions will be added regularly. If I >>>>> have to setup multiple instances of Hive for each (potentially) >>>>> overlapping job, it will be difficult to keep track of the partitions >>>>> that have been added. In the context of the preceding question, what >>>>> is the best way to add metadata about new partitions? >>>>> >>>>> Thanks in advance! >>>>> >>>>> --Tom >>>>> >>>> >>>> >>
-
Re: Best practice for automating jobs
Alexander Alten-Lorenz 2013-01-11, 07:23
+1 This is the best solution to automate jobs. cheers, Alex On Jan 10, 2013, at 11:11 PM, Sean McNamara <[EMAIL PROTECTED]> wrote: >> I want to know if there are any accepted patterns or best practices for >> this? > > http://oozie.apache.org/> > > >> New partitions will be added regularly > > What type of partitions are you adding? Why frequently? > > > > > Sean > > > On 1/10/13 3:03 PM, "Tom Brown" <[EMAIL PROTECTED]> wrote: > >> All, >> >> I want to automate jobs against Hive (using an external table with >> ever growing partitions), and I'm running into a few challenges: >> >> Concurrency - If I run Hive as a thrift server, I can only safely run >> one job at a time. As such, it seems like my best bet will be to run >> it from the command line and setup a brand new instance for each job. >> That quite a bit of a hassle to solves a seemingly common problem, so >> I want to know if there are any accepted patterns or best practices >> for this? >> >> Partition management - New partitions will be added regularly. If I >> have to setup multiple instances of Hive for each (potentially) >> overlapping job, it will be difficult to keep track of the partitions >> that have been added. In the context of the preceding question, what >> is the best way to add metadata about new partitions? >> >> Thanks in advance! >> >> --Tom > -- Alexander Alten-Lorenz http://mapredit.blogspot.comGerman Hadoop LinkedIn Group: http://goo.gl/N8pCF
-
Re: Best practice for automating jobs
Manish Malhotra 2013-01-11, 18:56
When you are using Cli library ... it internally uses ZK or configured / support locking service, so no extra effort is required to do that. Though there is a patch for hiveserver leak zookeeper HIVE-3723 , which people are trying on 0.9 and 0.10. Regards, Manish On Thu, Jan 10, 2013 at 11:23 PM, Alexander Alten-Lorenz < [EMAIL PROTECTED]> wrote: > +1 > > This is the best solution to automate jobs. > > cheers, > Alex > > On Jan 10, 2013, at 11:11 PM, Sean McNamara <[EMAIL PROTECTED]> > wrote: > > >> I want to know if there are any accepted patterns or best practices for > >> this? > > > > http://oozie.apache.org/> > > > > > > >> New partitions will be added regularly > > > > What type of partitions are you adding? Why frequently? > > > > > > > > > > Sean > > > > > > On 1/10/13 3:03 PM, "Tom Brown" <[EMAIL PROTECTED]> wrote: > > > >> All, > >> > >> I want to automate jobs against Hive (using an external table with > >> ever growing partitions), and I'm running into a few challenges: > >> > >> Concurrency - If I run Hive as a thrift server, I can only safely run > >> one job at a time. As such, it seems like my best bet will be to run > >> it from the command line and setup a brand new instance for each job. > >> That quite a bit of a hassle to solves a seemingly common problem, so > >> I want to know if there are any accepted patterns or best practices > >> for this? > >> > >> Partition management - New partitions will be added regularly. If I > >> have to setup multiple instances of Hive for each (potentially) > >> overlapping job, it will be difficult to keep track of the partitions > >> that have been added. In the context of the preceding question, what > >> is the best way to add metadata about new partitions? > >> > >> Thanks in advance! > >> > >> --Tom > > > > -- > Alexander Alten-Lorenz > http://mapredit.blogspot.com> German Hadoop LinkedIn Group: http://goo.gl/N8pCF> >
-
Re: Best practice for automating jobs
Tom Brown 2013-01-11, 22:58
Thank you all very much for your feedback and ideas! --Tom On Fri, Jan 11, 2013 at 11:56 AM, Manish Malhotra <[EMAIL PROTECTED]> wrote: > When you are using Cli library ... it internally uses ZK or configured / > support locking service, so no extra effort is required to do that. > > Though there is a patch for hiveserver leak zookeeper HIVE-3723 , which > people are trying on 0.9 and 0.10. > > Regards, > Manish > > > On Thu, Jan 10, 2013 at 11:23 PM, Alexander Alten-Lorenz > <[EMAIL PROTECTED]> wrote: >> >> +1 >> >> This is the best solution to automate jobs. >> >> cheers, >> Alex >> >> On Jan 10, 2013, at 11:11 PM, Sean McNamara <[EMAIL PROTECTED]> >> wrote: >> >> >> I want to know if there are any accepted patterns or best practices for >> >> this? >> > >> > http://oozie.apache.org/>> > >> > >> > >> >> New partitions will be added regularly >> > >> > What type of partitions are you adding? Why frequently? >> > >> > >> > >> > >> > Sean >> > >> > >> > On 1/10/13 3:03 PM, "Tom Brown" <[EMAIL PROTECTED]> wrote: >> > >> >> All, >> >> >> >> I want to automate jobs against Hive (using an external table with >> >> ever growing partitions), and I'm running into a few challenges: >> >> >> >> Concurrency - If I run Hive as a thrift server, I can only safely run >> >> one job at a time. As such, it seems like my best bet will be to run >> >> it from the command line and setup a brand new instance for each job. >> >> That quite a bit of a hassle to solves a seemingly common problem, so >> >> I want to know if there are any accepted patterns or best practices >> >> for this? >> >> >> >> Partition management - New partitions will be added regularly. If I >> >> have to setup multiple instances of Hive for each (potentially) >> >> overlapping job, it will be difficult to keep track of the partitions >> >> that have been added. In the context of the preceding question, what >> >> is the best way to add metadata about new partitions? >> >> >> >> Thanks in advance! >> >> >> >> --Tom >> > >> >> -- >> Alexander Alten-Lorenz >> http://mapredit.blogspot.com>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF>> >
|
|