|
|
Kim Vogt 2011-02-03, 23:52
Hey,
I have a bunch of files where the filename is significant. I'm loading the files by supplying the top level directory that contains the files. Is there a way to capture the filename of the file and append to the tuple of data that's in that file?
-Kim
-
Re: Use Filename in Tuple
Dmitriy Ryaboy 2011-02-04, 03:49
In pig 6, you can hook into bindTo() and save the file name.
In pig 8 you have to find your way to the underlying InputSplit via PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath() on it.. I think. Haven't done this.
This will totally break if you have splitCombination turned on, of course, as pig can silently move to a different file under you, so you'd have to turn that off.
D
On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: > Hey, > > I have a bunch of files where the filename is significant. I'm loading the > files by supplying the top level directory that contains the files. Is > there a way to capture the filename of the file and append to the tuple of > data that's in that file? > > -Kim >
-
Re: Use Filename in Tuple
Kim Vogt 2011-02-04, 04:08
Thanks Dmitriy!
I'm using pig 8 and no splitCombination (I don't think). I accept this challenge and will keep you pig'ites updated.
-Kim
On Feb 3, 2011, at 7:49 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> In pig 6, you can hook into bindTo() and save the file name. > > In pig 8 you have to find your way to the underlying InputSplit via > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath() > on it.. I think. Haven't done this. > > This will totally break if you have splitCombination turned on, of > course, as pig can silently move to a different file under you, so > you'd have to turn that off. > > D > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: >> Hey, >> >> I have a bunch of files where the filename is significant. I'm loading the >> files by supplying the top level directory that contains the files. Is >> there a way to capture the filename of the file and append to the tuple of >> data that's in that file? >> >> -Kim >>
-
Re: Use Filename in Tuple
Dexin Wang 2011-02-04, 04:32
Similarly, is it possible to insert some literal values to a tuple stream?
For example, when I invoke my Pig script, I already know what data source is (say, it's from filename_2011-02-03), so I can just pass it to Pig using -param, and I want to insert this known file name to the tuple stream. How can I do that?
Example, I have:
grunt> A = LOAD 'aa' AS (f1, f2); grunt> DUMP A; (aa,bb) (cc,dd)
I want to do something like:
grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
Thanks.
On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> In pig 6, you can hook into bindTo() and save the file name. > > In pig 8 you have to find your way to the underlying InputSplit via > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath() > on it.. I think. Haven't done this. > > This will totally break if you have splitCombination turned on, of > course, as pig can silently move to a different file under you, so > you'd have to turn that off. > > D > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: > > Hey, > > > > I have a bunch of files where the filename is significant. I'm loading > the > > files by supplying the top level directory that contains the files. Is > > there a way to capture the filename of the file and append to the tuple > of > > data that's in that file? > > > > -Kim > > >
-
Re: Use Filename in Tuple
Kim Vogt 2011-02-04, 05:40
This should work:
grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';
or
grunt> B = FOREACH A GENERATE f1, '$paramName';
-Kim
On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
> Similarly, is it possible to insert some literal values to a tuple stream? > > For example, when I invoke my Pig script, I already know what data source > is > (say, it's from filename_2011-02-03), so I can just pass it to Pig using > -param, and I want to insert this known file name to the tuple stream. How > can I do that? > > Example, I have: > > grunt> A = LOAD 'aa' AS (f1, f2); > grunt> DUMP A; > (aa,bb) > (cc,dd) > > I want to do something like: > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03"; > > Thanks. > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > > > In pig 6, you can hook into bindTo() and save the file name. > > > > In pig 8 you have to find your way to the underlying InputSplit via > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath() > > on it.. I think. Haven't done this. > > > > This will totally break if you have splitCombination turned on, of > > course, as pig can silently move to a different file under you, so > > you'd have to turn that off. > > > > D > > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: > > > Hey, > > > > > > I have a bunch of files where the filename is significant. I'm loading > > the > > > files by supplying the top level directory that contains the files. Is > > > there a way to capture the filename of the file and append to the tuple > > of > > > data that's in that file? > > > > > > -Kim > > > > > >
-
Re: Use Filename in Tuple
Dexin Wang 2011-02-04, 05:43
wow, I almost got it right. Double quote, fails. Single quote, works.
Thanks.
On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <[EMAIL PROTECTED]> wrote:
> This should work: > > grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03'; > > or > > grunt> B = FOREACH A GENERATE f1, '$paramName'; > > -Kim > > On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <[EMAIL PROTECTED]> wrote: > > > Similarly, is it possible to insert some literal values to a tuple > stream? > > > > For example, when I invoke my Pig script, I already know what data source > > is > > (say, it's from filename_2011-02-03), so I can just pass it to Pig using > > -param, and I want to insert this known file name to the tuple stream. > How > > can I do that? > > > > Example, I have: > > > > grunt> A = LOAD 'aa' AS (f1, f2); > > grunt> DUMP A; > > (aa,bb) > > (cc,dd) > > > > I want to do something like: > > > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03"; > > > > Thanks. > > > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > > > In pig 6, you can hook into bindTo() and save the file name. > > > > > > In pig 8 you have to find your way to the underlying InputSplit via > > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath() > > > on it.. I think. Haven't done this. > > > > > > This will totally break if you have splitCombination turned on, of > > > course, as pig can silently move to a different file under you, so > > > you'd have to turn that off. > > > > > > D > > > > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: > > > > Hey, > > > > > > > > I have a bunch of files where the filename is significant. I'm > loading > > > the > > > > files by supplying the top level directory that contains the files. > Is > > > > there a way to capture the filename of the file and append to the > tuple > > > of > > > > data that's in that file? > > > > > > > > -Kim > > > > > > > > > >
-
Re: Use Filename in Tuple
Kim Vogt 2011-02-04, 05:53
And to include the filename in the tuple with the data, I copied PigStorage (I'm loading csv), created a private PigSplit object, set this object in "prepareToRead", and added this code before returning the tuple in "getNext",
if (mSplit != null) { FileSplit fs = (FileSplit) mSplit.getWrappedSplit(); Path p = fs.getPath(); mProtoTuple.add(p.toString()); }
And it works! Thanks again :-)
-Kim
On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
> wow, I almost got it right. Double quote, fails. Single quote, works. > > Thanks. > > On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: > > > This should work: > > > > grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03'; > > > > or > > > > grunt> B = FOREACH A GENERATE f1, '$paramName'; > > > > -Kim > > > > On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <[EMAIL PROTECTED]> wrote: > > > > > Similarly, is it possible to insert some literal values to a tuple > > stream? > > > > > > For example, when I invoke my Pig script, I already know what data > source > > > is > > > (say, it's from filename_2011-02-03), so I can just pass it to Pig > using > > > -param, and I want to insert this known file name to the tuple stream. > > How > > > can I do that? > > > > > > Example, I have: > > > > > > grunt> A = LOAD 'aa' AS (f1, f2); > > > grunt> DUMP A; > > > (aa,bb) > > > (cc,dd) > > > > > > I want to do something like: > > > > > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03"; > > > > > > Thanks. > > > > > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > > > > > > In pig 6, you can hook into bindTo() and save the file name. > > > > > > > > In pig 8 you have to find your way to the underlying InputSplit via > > > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath() > > > > on it.. I think. Haven't done this. > > > > > > > > This will totally break if you have splitCombination turned on, of > > > > course, as pig can silently move to a different file under you, so > > > > you'd have to turn that off. > > > > > > > > D > > > > > > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: > > > > > Hey, > > > > > > > > > > I have a bunch of files where the filename is significant. I'm > > loading > > > > the > > > > > files by supplying the top level directory that contains the files. > > Is > > > > > there a way to capture the filename of the file and append to the > > tuple > > > > of > > > > > data that's in that file? > > > > > > > > > > -Kim > > > > > > > > > > > > > > >
-
Re: Use Filename in Tuple
Dmitriy Ryaboy 2011-02-04, 06:11
There's a CSV loader in the piggybank that does proper CSV escaping, if you are interested.
On Thu, Feb 3, 2011 at 9:53 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: > And to include the filename in the tuple with the data, I copied PigStorage > (I'm loading csv), created a private PigSplit object, set this object in > "prepareToRead", and added this code before returning the tuple in > "getNext", > > if (mSplit != null) { > FileSplit fs = (FileSplit) mSplit.getWrappedSplit(); > Path p = fs.getPath(); > mProtoTuple.add(p.toString()); > } > > And it works! Thanks again :-) > > -Kim > > On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang <[EMAIL PROTECTED]> wrote: > >> wow, I almost got it right. Double quote, fails. Single quote, works. >> >> Thanks. >> >> On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: >> >> > This should work: >> > >> > grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03'; >> > >> > or >> > >> > grunt> B = FOREACH A GENERATE f1, '$paramName'; >> > >> > -Kim >> > >> > On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <[EMAIL PROTECTED]> wrote: >> > >> > > Similarly, is it possible to insert some literal values to a tuple >> > stream? >> > > >> > > For example, when I invoke my Pig script, I already know what data >> source >> > > is >> > > (say, it's from filename_2011-02-03), so I can just pass it to Pig >> using >> > > -param, and I want to insert this known file name to the tuple stream. >> > How >> > > can I do that? >> > > >> > > Example, I have: >> > > >> > > grunt> A = LOAD 'aa' AS (f1, f2); >> > > grunt> DUMP A; >> > > (aa,bb) >> > > (cc,dd) >> > > >> > > I want to do something like: >> > > >> > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03"; >> > > >> > > Thanks. >> > > >> > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> >> > wrote: >> > > >> > > > In pig 6, you can hook into bindTo() and save the file name. >> > > > >> > > > In pig 8 you have to find your way to the underlying InputSplit via >> > > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath() >> > > > on it.. I think. Haven't done this. >> > > > >> > > > This will totally break if you have splitCombination turned on, of >> > > > course, as pig can silently move to a different file under you, so >> > > > you'd have to turn that off. >> > > > >> > > > D >> > > > >> > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: >> > > > > Hey, >> > > > > >> > > > > I have a bunch of files where the filename is significant. I'm >> > loading >> > > > the >> > > > > files by supplying the top level directory that contains the files. >> > Is >> > > > > there a way to capture the filename of the file and append to the >> > tuple >> > > > of >> > > > > data that's in that file? >> > > > > >> > > > > -Kim >> > > > > >> > > > >> > > >> > >> >
-
Re: Use Filename in Tuple
Kim Vogt 2011-02-04, 19:29
I switched to using the CSVLoader in piggybank, and appended the filepath to the current RecordReader instead.
-Kim
On Thu, Feb 3, 2011 at 10:11 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> There's a CSV loader in the piggybank that does proper CSV escaping, > if you are interested. > > On Thu, Feb 3, 2011 at 9:53 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: > > And to include the filename in the tuple with the data, I copied > PigStorage > > (I'm loading csv), created a private PigSplit object, set this object in > > "prepareToRead", and added this code before returning the tuple in > > "getNext", > > > > if (mSplit != null) { > > FileSplit fs = (FileSplit) mSplit.getWrappedSplit(); > > Path p = fs.getPath(); > > mProtoTuple.add(p.toString()); > > } > > > > And it works! Thanks again :-) > > > > -Kim > > > > On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang <[EMAIL PROTECTED]> wrote: > > > >> wow, I almost got it right. Double quote, fails. Single quote, works. > >> > >> Thanks. > >> > >> On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <[EMAIL PROTECTED]> wrote: > >> > >> > This should work: > >> > > >> > grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03'; > >> > > >> > or > >> > > >> > grunt> B = FOREACH A GENERATE f1, '$paramName'; > >> > > >> > -Kim > >> > > >> > On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <[EMAIL PROTECTED]> > wrote: > >> > > >> > > Similarly, is it possible to insert some literal values to a tuple > >> > stream? > >> > > > >> > > For example, when I invoke my Pig script, I already know what data > >> source > >> > > is > >> > > (say, it's from filename_2011-02-03), so I can just pass it to Pig > >> using > >> > > -param, and I want to insert this known file name to the tuple > stream. > >> > How > >> > > can I do that? > >> > > > >> > > Example, I have: > >> > > > >> > > grunt> A = LOAD 'aa' AS (f1, f2); > >> > > grunt> DUMP A; > >> > > (aa,bb) > >> > > (cc,dd) > >> > > > >> > > I want to do something like: > >> > > > >> > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03"; > >> > > > >> > > Thanks. > >> > > > >> > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > >> > wrote: > >> > > > >> > > > In pig 6, you can hook into bindTo() and save the file name. > >> > > > > >> > > > In pig 8 you have to find your way to the underlying InputSplit > via > >> > > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call > getPath() > >> > > > on it.. I think. Haven't done this. > >> > > > > >> > > > This will totally break if you have splitCombination turned on, of > >> > > > course, as pig can silently move to a different file under you, so > >> > > > you'd have to turn that off. > >> > > > > >> > > > D > >> > > > > >> > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <[EMAIL PROTECTED]> > wrote: > >> > > > > Hey, > >> > > > > > >> > > > > I have a bunch of files where the filename is significant. I'm > >> > loading > >> > > > the > >> > > > > files by supplying the top level directory that contains the > files. > >> > Is > >> > > > > there a way to capture the filename of the file and append to > the > >> > tuple > >> > > > of > >> > > > > data that's in that file? > >> > > > > > >> > > > > -Kim > >> > > > > > >> > > > > >> > > > >> > > >> > > >
|
|