|
|
-
trying to count all tuples
William Oberman 2011-06-03, 19:53
Howdy,
I'm coming from cassandra, and I'm actually trying to count all columns in a column family. I believe that is similar to counting the number tuples in a bag in the lingo in the pig manual. It was harder than I expected, but I think this works: rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)}); counts = FOREACH rows GENERATE COUNT(columns); counts_in_bag = GROUP counts ALL; sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); dump sum_of_bag;
My question is: am I right that it works? I started with 3 keys having a total of 5 columns and got (5). Then I added a new key/column, and another column on an existing key and got (7). So, it seems like it's working. But, was there a better way to write it?
Thanks!
will
+
William Oberman 2011-06-03, 19:53
-
Re: trying to count all tuples
Dmitriy Ryaboy 2011-06-03, 20:06
I am not sure what you mean by "count all columns". The code you have counts all *cells*. So: id1: col1, col2 id2: col1, col2, col3
has 3 columns in a conventional sense, but your code will return 5. Is that what you want? If so, your code seems correct.
D
On Fri, Jun 3, 2011 at 12:53 PM, William Oberman <[EMAIL PROTECTED]> wrote: > Howdy, > > I'm coming from cassandra, and I'm actually trying to count all columns in a > column family. I believe that is similar to counting the number tuples in a > bag in the lingo in the pig manual. It was harder than I expected, but I > think this works: > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING CassandraStorage() > AS (key, columns: bag {T: tuple(name, value)}); > counts = FOREACH rows GENERATE COUNT(columns); > counts_in_bag = GROUP counts ALL; > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); > dump sum_of_bag; > > My question is: am I right that it works? I started with 3 keys having a > total of 5 columns and got (5). Then I added a new key/column, and another > column on an existing key and got (7). So, it seems like it's working. > But, was there a better way to write it? > > Thanks! > > will >
+
Dmitriy Ryaboy 2011-06-03, 20:06
-
Re: trying to count all tuples
William Oberman 2011-06-03, 20:09
That is exactly what I wanted, thanks for the confirm!
On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> I am not sure what you mean by "count all columns". The code you have > counts all *cells*. > So: > id1: col1, col2 > id2: col1, col2, col3 > > has 3 columns in a conventional sense, but your code will return 5. Is > that what you want? If so, your code seems correct. > > D > > On Fri, Jun 3, 2011 at 12:53 PM, William Oberman > <[EMAIL PROTECTED]> wrote: > > Howdy, > > > > I'm coming from cassandra, and I'm actually trying to count all columns > in a > > column family. I believe that is similar to counting the number tuples > in a > > bag in the lingo in the pig manual. It was harder than I expected, but I > > think this works: > > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING > CassandraStorage() > > AS (key, columns: bag {T: tuple(name, value)}); > > counts = FOREACH rows GENERATE COUNT(columns); > > counts_in_bag = GROUP counts ALL; > > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); > > dump sum_of_bag; > > > > My question is: am I right that it works? I started with 3 keys having a > > total of 5 columns and got (5). Then I added a new key/column, and > another > > column on an existing key and got (7). So, it seems like it's working. > > But, was there a better way to write it? > > > > Thanks! > > > > will > > >
-- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) [EMAIL PROTECTED]
+
William Oberman 2011-06-03, 20:09
-
Re: trying to count all tuples
William Oberman 2011-06-07, 20:33
I tried this same script on closer to production data, and I'm getting errors. I'm 50% sure it's this: https://issues.apache.org/jira/browse/PIG-1283One of my rows in cassandra has no columns (maybe?), which maybe causes a null bag, which causes COUNT to blow up (at least, that's my theory). As a workaround, can I have COUNT ignore/skip rows with null columns? I'll start digging through the docs as well. will On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <[EMAIL PROTECTED]>wrote: > That is exactly what I wanted, thanks for the confirm! > > > On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> I am not sure what you mean by "count all columns". The code you have >> counts all *cells*. >> So: >> id1: col1, col2 >> id2: col1, col2, col3 >> >> has 3 columns in a conventional sense, but your code will return 5. Is >> that what you want? If so, your code seems correct. >> >> D >> >> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman >> <[EMAIL PROTECTED]> wrote: >> > Howdy, >> > >> > I'm coming from cassandra, and I'm actually trying to count all columns >> in a >> > column family. I believe that is similar to counting the number tuples >> in a >> > bag in the lingo in the pig manual. It was harder than I expected, but >> I >> > think this works: >> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING >> CassandraStorage() >> > AS (key, columns: bag {T: tuple(name, value)}); >> > counts = FOREACH rows GENERATE COUNT(columns); >> > counts_in_bag = GROUP counts ALL; >> > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); >> > dump sum_of_bag; >> > >> > My question is: am I right that it works? I started with 3 keys having >> a >> > total of 5 columns and got (5). Then I added a new key/column, and >> another >> > column on an existing key and got (7). So, it seems like it's working. >> > But, was there a better way to write it? >> > >> > Thanks! >> > >> > will >> > >> > > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) [EMAIL PROTECTED] > -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) [EMAIL PROTECTED]
+
William Oberman 2011-06-07, 20:33
-
Re: trying to count all tuples
William Oberman 2011-06-07, 20:58
I think FILTER will do the trick? E.g. rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)}); filter_rows = FILTER rows BY columns is not null; counts = FOREACH filter_rows GENERATE COUNT(columns); counts_in_bag = GROUP counts ALL; sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); dump sum_of_bag; On Tue, Jun 7, 2011 at 4:33 PM, William Oberman <[EMAIL PROTECTED]>wrote: > I tried this same script on closer to production data, and I'm getting > errors. I'm 50% sure it's this: > https://issues.apache.org/jira/browse/PIG-1283> > One of my rows in cassandra has no columns (maybe?), which maybe causes a > null bag, which causes COUNT to blow up (at least, that's my theory). As a > workaround, can I have COUNT ignore/skip rows with null columns? I'll start > digging through the docs as well. > > will > > > On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <[EMAIL PROTECTED]>wrote: > >> That is exactly what I wanted, thanks for the confirm! >> >> >> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote: >> >>> I am not sure what you mean by "count all columns". The code you have >>> counts all *cells*. >>> So: >>> id1: col1, col2 >>> id2: col1, col2, col3 >>> >>> has 3 columns in a conventional sense, but your code will return 5. Is >>> that what you want? If so, your code seems correct. >>> >>> D >>> >>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman >>> <[EMAIL PROTECTED]> wrote: >>> > Howdy, >>> > >>> > I'm coming from cassandra, and I'm actually trying to count all columns >>> in a >>> > column family. I believe that is similar to counting the number tuples >>> in a >>> > bag in the lingo in the pig manual. It was harder than I expected, but >>> I >>> > think this works: >>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING >>> CassandraStorage() >>> > AS (key, columns: bag {T: tuple(name, value)}); >>> > counts = FOREACH rows GENERATE COUNT(columns); >>> > counts_in_bag = GROUP counts ALL; >>> > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); >>> > dump sum_of_bag; >>> > >>> > My question is: am I right that it works? I started with 3 keys having >>> a >>> > total of 5 columns and got (5). Then I added a new key/column, and >>> another >>> > column on an existing key and got (7). So, it seems like it's working. >>> > But, was there a better way to write it? >>> > >>> > Thanks! >>> > >>> > will >>> > >>> >> >> >> >> -- >> Will Oberman >> Civic Science, Inc. >> 3030 Penn Avenue., First Floor >> Pittsburgh, PA 15201 >> (M) 412-480-7835 >> (E) [EMAIL PROTECTED] >> > > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) [EMAIL PROTECTED] > -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) [EMAIL PROTECTED]
+
William Oberman 2011-06-07, 20:58
-
Re: trying to count all tuples
William Oberman 2011-06-08, 20:56
Just in case this ends up as someone else's answer someday, here is the working query on real data: rows = LOAD 'cassandra://civicscience/observations' USING CassandraStorage(); filter_rows = FILTER rows BY $1 is not null; counts = FOREACH filter_rows GENERATE COUNT($1); counts_in_bag = GROUP counts ALL; sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); dump sum_of_bag; For some reason typing the bag was causing me problems. On Tue, Jun 7, 2011 at 4:58 PM, William Oberman <[EMAIL PROTECTED]>wrote: > I think FILTER will do the trick? E.g. > > > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING > CassandraStorage() AS (key, columns: bag {T: tuple(name, value)}); > filter_rows = FILTER rows BY columns is not null; > counts = FOREACH filter_rows GENERATE COUNT(columns); > > counts_in_bag = GROUP counts ALL; > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); > dump sum_of_bag; > > > On Tue, Jun 7, 2011 at 4:33 PM, William Oberman <[EMAIL PROTECTED]>wrote: > >> I tried this same script on closer to production data, and I'm getting >> errors. I'm 50% sure it's this: >> https://issues.apache.org/jira/browse/PIG-1283>> >> One of my rows in cassandra has no columns (maybe?), which maybe causes a >> null bag, which causes COUNT to blow up (at least, that's my theory). As a >> workaround, can I have COUNT ignore/skip rows with null columns? I'll start >> digging through the docs as well. >> >> will >> >> >> On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <[EMAIL PROTECTED] >> > wrote: >> >>> That is exactly what I wanted, thanks for the confirm! >>> >>> >>> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote: >>> >>>> I am not sure what you mean by "count all columns". The code you have >>>> counts all *cells*. >>>> So: >>>> id1: col1, col2 >>>> id2: col1, col2, col3 >>>> >>>> has 3 columns in a conventional sense, but your code will return 5. Is >>>> that what you want? If so, your code seems correct. >>>> >>>> D >>>> >>>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman >>>> <[EMAIL PROTECTED]> wrote: >>>> > Howdy, >>>> > >>>> > I'm coming from cassandra, and I'm actually trying to count all >>>> columns in a >>>> > column family. I believe that is similar to counting the number >>>> tuples in a >>>> > bag in the lingo in the pig manual. It was harder than I expected, >>>> but I >>>> > think this works: >>>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING >>>> CassandraStorage() >>>> > AS (key, columns: bag {T: tuple(name, value)}); >>>> > counts = FOREACH rows GENERATE COUNT(columns); >>>> > counts_in_bag = GROUP counts ALL; >>>> > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); >>>> > dump sum_of_bag; >>>> > >>>> > My question is: am I right that it works? I started with 3 keys >>>> having a >>>> > total of 5 columns and got (5). Then I added a new key/column, and >>>> another >>>> > column on an existing key and got (7). So, it seems like it's >>>> working. >>>> > But, was there a better way to write it? >>>> > >>>> > Thanks! >>>> > >>>> > will >>>> > >>>> >>> >>> >>> >>> -- >>> Will Oberman >>> Civic Science, Inc. >>> 3030 Penn Avenue., First Floor >>> Pittsburgh, PA 15201 >>> (M) 412-480-7835 >>> (E) [EMAIL PROTECTED] >>> >> >> >> >> -- >> Will Oberman >> Civic Science, Inc. >> 3030 Penn Avenue., First Floor >> Pittsburgh, PA 15201 >> (M) 412-480-7835 >> (E) [EMAIL PROTECTED] >> > > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) [EMAIL PROTECTED] > -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) [EMAIL PROTECTED]
+
William Oberman 2011-06-08, 20:56
-
Re: trying to count all tuples
Dmitriy Ryaboy 2011-06-08, 21:31
Thanks for following through William! D On Wed, Jun 8, 2011 at 1:56 PM, William Oberman <[EMAIL PROTECTED]> wrote: > Just in case this ends up as someone else's answer someday, here is the > working query on real data: > rows = LOAD 'cassandra://civicscience/observations' USING > CassandraStorage(); > filter_rows = FILTER rows BY $1 is not null; > counts = FOREACH filter_rows GENERATE COUNT($1); > counts_in_bag = GROUP counts ALL; > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); > dump sum_of_bag; > > For some reason typing the bag was causing me problems. > > On Tue, Jun 7, 2011 at 4:58 PM, William Oberman <[EMAIL PROTECTED]>wrote: > >> I think FILTER will do the trick? E.g. >> >> >> rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING >> CassandraStorage() AS (key, columns: bag {T: tuple(name, value)}); >> filter_rows = FILTER rows BY columns is not null; >> counts = FOREACH filter_rows GENERATE COUNT(columns); >> >> counts_in_bag = GROUP counts ALL; >> sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); >> dump sum_of_bag; >> >> >> On Tue, Jun 7, 2011 at 4:33 PM, William Oberman <[EMAIL PROTECTED]>wrote: >> >>> I tried this same script on closer to production data, and I'm getting >>> errors. I'm 50% sure it's this: >>> https://issues.apache.org/jira/browse/PIG-1283>>> >>> One of my rows in cassandra has no columns (maybe?), which maybe causes a >>> null bag, which causes COUNT to blow up (at least, that's my theory). As a >>> workaround, can I have COUNT ignore/skip rows with null columns? I'll start >>> digging through the docs as well. >>> >>> will >>> >>> >>> On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <[EMAIL PROTECTED] >>> > wrote: >>> >>>> That is exactly what I wanted, thanks for the confirm! >>>> >>>> >>>> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote: >>>> >>>>> I am not sure what you mean by "count all columns". The code you have >>>>> counts all *cells*. >>>>> So: >>>>> id1: col1, col2 >>>>> id2: col1, col2, col3 >>>>> >>>>> has 3 columns in a conventional sense, but your code will return 5. Is >>>>> that what you want? If so, your code seems correct. >>>>> >>>>> D >>>>> >>>>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman >>>>> <[EMAIL PROTECTED]> wrote: >>>>> > Howdy, >>>>> > >>>>> > I'm coming from cassandra, and I'm actually trying to count all >>>>> columns in a >>>>> > column family. I believe that is similar to counting the number >>>>> tuples in a >>>>> > bag in the lingo in the pig manual. It was harder than I expected, >>>>> but I >>>>> > think this works: >>>>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING >>>>> CassandraStorage() >>>>> > AS (key, columns: bag {T: tuple(name, value)}); >>>>> > counts = FOREACH rows GENERATE COUNT(columns); >>>>> > counts_in_bag = GROUP counts ALL; >>>>> > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); >>>>> > dump sum_of_bag; >>>>> > >>>>> > My question is: am I right that it works? I started with 3 keys >>>>> having a >>>>> > total of 5 columns and got (5). Then I added a new key/column, and >>>>> another >>>>> > column on an existing key and got (7). So, it seems like it's >>>>> working. >>>>> > But, was there a better way to write it? >>>>> > >>>>> > Thanks! >>>>> > >>>>> > will >>>>> > >>>>> >>>> >>>> >>>> >>>> -- >>>> Will Oberman >>>> Civic Science, Inc. >>>> 3030 Penn Avenue., First Floor >>>> Pittsburgh, PA 15201 >>>> (M) 412-480-7835 >>>> (E) [EMAIL PROTECTED] >>>> >>> >>> >>> >>> -- >>> Will Oberman >>> Civic Science, Inc. >>> 3030 Penn Avenue., First Floor >>> Pittsburgh, PA 15201 >>> (M) 412-480-7835 >>> (E) [EMAIL PROTECTED] >>> >> >> >> >> -- >> Will Oberman >> Civic Science, Inc. >> 3030 Penn Avenue., First Floor >> Pittsburgh, PA 15201 >> (M) 412-480-7835 >> (E) [EMAIL PROTECTED] >> > > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) [EMAIL PROTECTED]
+
Dmitriy Ryaboy 2011-06-08, 21:31
|
|