Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Working with an unknown number of values


+
Christian 2011-05-06, 21:14
+
Xiaomeng Wan 2011-05-06, 21:29
+
jacob 2011-05-06, 21:29
+
Christian 2011-05-06, 21:38
+
jacob 2011-05-06, 21:59
+
Christian 2011-05-06, 22:06
+
jacob 2011-05-06, 22:13
+
Dmitriy Ryaboy 2011-05-08, 01:41
Copy link to this message
-
Re: Working with an unknown number of values
Dmitriy,
 
   I see your point. It would definitely be nice to have a builtin for
returning a bag though. I'd actually be happy if
TOBAG(FLATTEN(STRSPLIT(X,','))) worked.

--jacob
@thedatachef

On Sat, 2011-05-07 at 18:41 -0700, Dmitriy Ryaboy wrote:
> FWIW -- the reason STRSPLIT returns a Tuple is that the more common
> case is thought to be splitting a string of a known format and trying
> to get some part of it.
>
> so, "foreach address_book generate STRSPLIT(phone_number, '-') as
> (area_code, top_3, bottom_4);"
>
> RegexExtractAll (whatever it's called these days) should return a bag, iirc.
>
> D
>
> On Fri, May 6, 2011 at 2:59 PM, jacob <[EMAIL PROTECTED]> wrote:
> > On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
> >> >
> >> > > #1) Let's say you are tracking messages and extracting the hash tags from
> >> > > the message and storing them as one field (#hash1#hash2#hash3). This
> >> > means
> >> > > you might have a line that looks something like the following:
> >> > >       2343    2011-05-06T03:04:00.000Z    username
> >> > > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
> >> >  other
> >> > >  info
> >> > >
> >> > > How can I get the # of tweets per hash tag? Also, how can I get the # of
> >> > > tweets per user per hash tag?
> >> > > I know I can use the STRSPLIT function to split on '#'. That will give me
> >> > a
> >> > > bag of hash tags. How can I then group by these such that each hash tag
> >> > has
> >> > > a set of tweets?
> >> > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
> >> > the hashtag itself.
> >> >
> >>
> >> If each message has an unknown number of hashtags, will a 'FLATTEN' given me
> >> an unknown # of fields? If so, how do I know which field to group by? I
> >> don't want to group by messages that have the exact hash tags. I want all
> >> messages that have one of the hash tags.
> >
> > Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
> > NOT a bag. If you could get a bag then you could do the following (I'm
> > throwing out some fields for now):
> >
> > A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
> > B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
> > hashtag;
> > C = GROUP B BY hastag;
> >
> > Then C will contain a key (the hashtag) and a bag containing all the
> > tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
> > to do the same as STRSPLIT but that returns a bag instead.
> >
> > ie.
> >
> > #foobar tweet text,#foobar
> > this tweet has #two #hashtags,#two#hashtags
> > another #foobar tweet,#foobar
> >
> > will yield:
> >
> > #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
> > #foobar)}
> > #two,      {(this tweet has #two #hashtags, #two)}
> > #hashtags, {(this tweet has #two #hashtags, #hashtags)}
> >
> >
> >>
> >>
> >> > >     But now I want to end up something like the following:
> >> >
> >> >
> >> > > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
> >> > >  1983
> >> > >
> >> > > If I knew the directives ahead of time, I know I can do something like
> >> > the
> >> > > following:
> >> > >
> >> > > D = GROUP C BY date;
> >> > >
> >> > > E = FOREACH D {
> >> > >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
> >> > >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
> >> > >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
> >> > >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
> >> > 'DIRECTIVE2',
> >> > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
> >> > > }
> >> > >
> >> > > But how do I do this w/o having to hardcode the filters? Am I thinking
> >> > about
> >> > > this all wrong?
> >> > >
> >> > It's really a matter of how you structure your data ahead of time.
> >> > Imagine the data looking like this instead (call it X):
> >> >
> >> > 201101,directive1
> >> > 201101,directive1
> >> > 201101,directive2
> >> > 201101,directive2
+
Alan Gates 2011-05-10, 21:27
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB