Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Working with an unknown number of values


+
Christian 2011-05-06, 21:14
+
Xiaomeng Wan 2011-05-06, 21:29
+
jacob 2011-05-06, 21:29
+
Christian 2011-05-06, 21:38
+
jacob 2011-05-06, 21:59
+
Christian 2011-05-06, 22:06
+
jacob 2011-05-06, 22:13
+
Dmitriy Ryaboy 2011-05-08, 01:41
Copy link to this message
-
Re: Working with an unknown number of values
Jacob Perkins 2011-05-08, 02:55
Dmitriy,
 
   I see your point. It would definitely be nice to have a builtin for
returning a bag though. I'd actually be happy if
TOBAG(FLATTEN(STRSPLIT(X,','))) worked.

--jacob
@thedatachef

On Sat, 2011-05-07 at 18:41 -0700, Dmitriy Ryaboy wrote:
> FWIW -- the reason STRSPLIT returns a Tuple is that the more common
> case is thought to be splitting a string of a known format and trying
> to get some part of it.
>
> so, "foreach address_book generate STRSPLIT(phone_number, '-') as
> (area_code, top_3, bottom_4);"
>
> RegexExtractAll (whatever it's called these days) should return a bag, iirc.
>
> D
>
> On Fri, May 6, 2011 at 2:59 PM, jacob <[EMAIL PROTECTED]> wrote:
> > On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
> >> >
> >> > > #1) Let's say you are tracking messages and extracting the hash tags from
> >> > > the message and storing them as one field (#hash1#hash2#hash3). This
> >> > means
> >> > > you might have a line that looks something like the following:
> >> > >       2343    2011-05-06T03:04:00.000Z    username
> >> > > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
> >> >  other
> >> > >  info
> >> > >
> >> > > How can I get the # of tweets per hash tag? Also, how can I get the # of
> >> > > tweets per user per hash tag?
> >> > > I know I can use the STRSPLIT function to split on '#'. That will give me
> >> > a
> >> > > bag of hash tags. How can I then group by these such that each hash tag
> >> > has
> >> > > a set of tweets?
> >> > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
> >> > the hashtag itself.
> >> >
> >>
> >> If each message has an unknown number of hashtags, will a 'FLATTEN' given me
> >> an unknown # of fields? If so, how do I know which field to group by? I
> >> don't want to group by messages that have the exact hash tags. I want all
> >> messages that have one of the hash tags.
> >
> > Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
> > NOT a bag. If you could get a bag then you could do the following (I'm
> > throwing out some fields for now):
> >
> > A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
> > B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
> > hashtag;
> > C = GROUP B BY hastag;
> >
> > Then C will contain a key (the hashtag) and a bag containing all the
> > tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
> > to do the same as STRSPLIT but that returns a bag instead.
> >
> > ie.
> >
> > #foobar tweet text,#foobar
> > this tweet has #two #hashtags,#two#hashtags
> > another #foobar tweet,#foobar
> >
> > will yield:
> >
> > #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
> > #foobar)}
> > #two,      {(this tweet has #two #hashtags, #two)}
> > #hashtags, {(this tweet has #two #hashtags, #hashtags)}
> >
> >
> >>
> >>
> >> > >     But now I want to end up something like the following:
> >> >
> >> >
> >> > > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
> >> > >  1983
> >> > >
> >> > > If I knew the directives ahead of time, I know I can do something like
> >> > the
> >> > > following:
> >> > >
> >> > > D = GROUP C BY date;
> >> > >
> >> > > E = FOREACH D {
> >> > >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
> >> > >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
> >> > >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
> >> > >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
> >> > 'DIRECTIVE2',
> >> > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> >> > > }
> >> > >
> >> > > But how do I do this w/o having to hardcode the filters? Am I thinking
> >> > about
> >> > > this all wrong?
> >> > >
> >> > It's really a matter of how you structure your data ahead of time.
> >> > Imagine the data looking like this instead (call it X):
> >> >
> >> > 201101,directive1
> >> > 201101,directive1
> >> > 201101,directive2
> >> > 201101,directive2
+
Alan Gates 2011-05-10, 21:27