|
Mohit Anchlia
2012-09-10, 23:11
Ruslan Al-Fakikh
2012-09-10, 23:17
Mohit Anchlia
2012-09-10, 23:29
Ruslan Al-Fakikh
2012-09-11, 15:12
MiaoMiao
2012-09-13, 02:31
Ruslan Al-Fakikh
2012-09-13, 21:04
Aniket Mokashi
2012-09-15, 00:01
MiaoMiao
2012-09-17, 05:11
|
-
Input and output pathMohit Anchlia 2012-09-10, 23:11
Our input path is something like YYYY/MM/DD/HH/input and we like to write
to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String and convert it to YYYY/MM/DD/HH/output that I can use in "store into" clause?
-
Re: Input and output pathRuslan Al-Fakikh 2012-09-10, 23:17
Mohit,
I guess you could use parameters substitution here http://wiki.apache.org/pig/ParameterSubstitution Also, a note about your architecture: You can consider using Hive partitions to effectively select appropriate dates in the folder names. But as your tool is Pig, not Hive, you can use HCatalog as a layer Best Regards On Tue, Sep 11, 2012 at 3:11 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > Our input path is something like YYYY/MM/DD/HH/input and we like to write > to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String > and convert it to YYYY/MM/DD/HH/output that I can use in "store into" > clause?
-
Re: Input and output pathMohit Anchlia 2012-09-10, 23:29
On Mon, Sep 10, 2012 at 4:17 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:
> Mohit, > > I guess you could use parameters substitution here > http://wiki.apache.org/pig/ParameterSubstitution > > thanks this works. > Also, a note about your architecture: > Are you suggesting change to the path names or your suggestion is to use HCatalog with pig? > You can consider using Hive partitions to effectively select > appropriate dates in the folder names. But as your tool is Pig, not > Hive, you can use HCatalog as a layer > > Best Regards > > On Tue, Sep 11, 2012 at 3:11 AM, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > > Our input path is something like YYYY/MM/DD/HH/input and we like to write > > to YYYY/MM/DD/HH/output . Is it possible to get the input path as a > String > > and convert it to YYYY/MM/DD/HH/output that I can use in "store into" > > clause? >
-
Re: Input and output pathRuslan Al-Fakikh 2012-09-11, 15:12
Mohit,
I am suggesting setting up a whole Hive warehouse. This way your folders will look like /user/hive/warehouse/yourdataset/date=2012-09-11 /user/hive/warehouse/yourdataset/date=2012-09-12 ... All the partitions' metadata will be kept in a RDBMS, so when you query them with Hive it will look like select * from yourdataset where date = 2012-09-11 and it will be fast HCatalog is a layer that provides this Hive's functionality to Pig and MapReduce, so in Pig you can FILTER by those dates. http://incubator.apache.org/hcatalog/docs/r0.4.0/loadstore.html#Load+Examples Best Regards On Tue, Sep 11, 2012 at 3:29 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > On Mon, Sep 10, 2012 at 4:17 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote: > >> Mohit, >> >> I guess you could use parameters substitution here >> http://wiki.apache.org/pig/ParameterSubstitution >> >> thanks this works. > > >> Also, a note about your architecture: >> > > Are you suggesting change to the path names or your suggestion is to use > HCatalog with pig? > > >> You can consider using Hive partitions to effectively select >> appropriate dates in the folder names. But as your tool is Pig, not >> Hive, you can use HCatalog as a layer >> >> Best Regards >> >> On Tue, Sep 11, 2012 at 3:11 AM, Mohit Anchlia <[EMAIL PROTECTED]> >> wrote: >> > Our input path is something like YYYY/MM/DD/HH/input and we like to write >> > to YYYY/MM/DD/HH/output . Is it possible to get the input path as a >> String >> > and convert it to YYYY/MM/DD/HH/output that I can use in "store into" >> > clause? >>
-
Re: Input and output pathMiaoMiao 2012-09-13, 02:31
I wrote a python script to do this
import sys yyyymmddhh = sys.argv[1] inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input" outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/output" pigScript = ''' some = load '$input' using PigStorage(',') as( id:INT, value:INT ); final = ..... ; STORE final INTO '$output' using PigStorage(','); ''' P = Pig.compile(pigScript) result = P.bind({'input':inputPath, 'output':outputPath}).runSingle() if result.isSuccessful() : print 'Pig job succeeded' else : raise 'Pig job failed' Then you can run it with pig pig -x local pig.py 2012091108 On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > Our input path is something like YYYY/MM/DD/HH/input and we like to write > to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String > and convert it to YYYY/MM/DD/HH/output that I can use in "store into" > clause?
-
Re: Input and output pathRuslan Al-Fakikh 2012-09-13, 21:04
MiaoMiao, Mohit,
If we are talking about embedding Pig into Python, I'd like to add that we can also embed Pig into Java using PigServer http://wiki.apache.org/pig/EmbeddedPig MiaoMiao, what's the purpose of embedding here (if we already have parameter substitution feature)? I guess Pig embedding is mostly suitable in case we want to add IF/ELSE or LOOP functionality Thanks On Thu, Sep 13, 2012 at 6:31 AM, MiaoMiao <[EMAIL PROTECTED]> wrote: > I wrote a python script to do this > > import sys > yyyymmddhh = sys.argv[1] > inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input" > outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/output" > pigScript = ''' > some = load '$input' using PigStorage(',') > as( > id:INT, > value:INT > ); > final = ..... ; > STORE final INTO '$output' using PigStorage(','); > ''' > P = Pig.compile(pigScript) > result = P.bind({'input':inputPath, 'output':outputPath}).runSingle() > if result.isSuccessful() : > print 'Pig job succeeded' > else : > raise 'Pig job failed' > > Then you can run it with pig > pig -x local pig.py 2012091108 > > On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: >> Our input path is something like YYYY/MM/DD/HH/input and we like to write >> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String >> and convert it to YYYY/MM/DD/HH/output that I can use in "store into" >> clause?
-
Re: Input and output pathAniket Mokashi 2012-09-15, 00:01
You can do something similar to -
https://cwiki.apache.org/PIG/faq.html#FAQ-Q%253AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%253F Get input path from pig and then substitute the values for date, hour etc. You have to also override getSchema method so that pig gets to see these fields. Just beware of -https://issues.apache.org/jira/browse/PIG-2462 Thanks, Aniket On Thu, Sep 13, 2012 at 2:04 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote: > MiaoMiao, Mohit, > > If we are talking about embedding Pig into Python, I'd like to add > that we can also embed Pig into Java using PigServer > http://wiki.apache.org/pig/EmbeddedPig > > MiaoMiao, what's the purpose of embedding here (if we already have > parameter substitution feature)? I guess Pig embedding is mostly > suitable in case we want to add IF/ELSE or LOOP functionality > > Thanks > > On Thu, Sep 13, 2012 at 6:31 AM, MiaoMiao <[EMAIL PROTECTED]> wrote: > > I wrote a python script to do this > > > > import sys > > yyyymmddhh = sys.argv[1] > > inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input" > > outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to > "YYYY/MM/DD/HH/output" > > pigScript = ''' > > some = load '$input' using PigStorage(',') > > as( > > id:INT, > > value:INT > > ); > > final = ..... ; > > STORE final INTO '$output' using PigStorage(','); > > ''' > > P = Pig.compile(pigScript) > > result = P.bind({'input':inputPath, 'output':outputPath}).runSingle() > > if result.isSuccessful() : > > print 'Pig job succeeded' > > else : > > raise 'Pig job failed' > > > > Then you can run it with pig > > pig -x local pig.py 2012091108 > > > > On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > >> Our input path is something like YYYY/MM/DD/HH/input and we like to > write > >> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a > String > >> and convert it to YYYY/MM/DD/HH/output that I can use in "store into" > >> clause? > -- "...:::Aniket:::... Quetzalco@tl"
-
Re: Input and output pathMiaoMiao 2012-09-17, 05:11
Ah, sorry I missed your former reply. I used python because it's more
flexible, and can generate Pig script from XML files containing all fields info in my input and output files. These XML files can also apply to Hive. On Fri, Sep 14, 2012 at 5:04 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: > MiaoMiao, Mohit, > > If we are talking about embedding Pig into Python, I'd like to add > that we can also embed Pig into Java using PigServer > http://wiki.apache.org/pig/EmbeddedPig > > MiaoMiao, what's the purpose of embedding here (if we already have > parameter substitution feature)? I guess Pig embedding is mostly > suitable in case we want to add IF/ELSE or LOOP functionality > > Thanks > > On Thu, Sep 13, 2012 at 6:31 AM, MiaoMiao <[EMAIL PROTECTED]> wrote: >> I wrote a python script to do this >> >> import sys >> yyyymmddhh = sys.argv[1] >> inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input" >> outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/output" >> pigScript = ''' >> some = load '$input' using PigStorage(',') >> as( >> id:INT, >> value:INT >> ); >> final = ..... ; >> STORE final INTO '$output' using PigStorage(','); >> ''' >> P = Pig.compile(pigScript) >> result = P.bind({'input':inputPath, 'output':outputPath}).runSingle() >> if result.isSuccessful() : >> print 'Pig job succeeded' >> else : >> raise 'Pig job failed' >> >> Then you can run it with pig >> pig -x local pig.py 2012091108 >> >> On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: >>> Our input path is something like YYYY/MM/DD/HH/input and we like to write >>> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String >>> and convert it to YYYY/MM/DD/HH/output that I can use in "store into" >>> clause? |