Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - trying to count all tuples


Copy link to this message
-
Re: trying to count all tuples
William Oberman 2011-06-08, 20:56
Just in case this ends up as someone else's answer someday, here is the
working query on real data:
rows = LOAD 'cassandra://civicscience/observations' USING
CassandraStorage();
filter_rows = FILTER rows BY $1 is not null;
counts = FOREACH filter_rows GENERATE COUNT($1);
counts_in_bag = GROUP counts ALL;
sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
dump sum_of_bag;

For some reason typing the bag was causing me problems.

On Tue, Jun 7, 2011 at 4:58 PM, William Oberman <[EMAIL PROTECTED]>wrote:

> I think FILTER will do the trick?  E.g.
>
>
> rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
> CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
> filter_rows = FILTER rows BY columns is not null;
> counts = FOREACH filter_rows GENERATE COUNT(columns);
>
> counts_in_bag = GROUP counts ALL;
> sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
> dump sum_of_bag;
>
>
> On Tue, Jun 7, 2011 at 4:33 PM, William Oberman <[EMAIL PROTECTED]>wrote:
>
>> I tried this same script on closer to production data, and I'm getting
>> errors.  I'm 50% sure it's this:
>> https://issues.apache.org/jira/browse/PIG-1283
>>
>> One of my rows in cassandra has no columns (maybe?), which maybe causes a
>> null bag, which causes COUNT to blow up (at least, that's my theory).  As a
>> workaround, can I have COUNT ignore/skip rows with null columns?  I'll start
>> digging through the docs as well.
>>
>> will
>>
>>
>> On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <[EMAIL PROTECTED]
>> > wrote:
>>
>>> That is exactly what I wanted, thanks for the confirm!
>>>
>>>
>>> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote:
>>>
>>>> I am not sure what you mean by "count all columns". The code you have
>>>> counts all *cells*.
>>>> So:
>>>> id1: col1, col2
>>>> id2: col1, col2, col3
>>>>
>>>> has 3 columns in a conventional sense, but your code will return 5. Is
>>>> that what you want? If so, your code seems correct.
>>>>
>>>> D
>>>>
>>>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
>>>> <[EMAIL PROTECTED]> wrote:
>>>> > Howdy,
>>>> >
>>>> > I'm coming from cassandra, and I'm actually trying to count all
>>>> columns in a
>>>> > column family.  I believe that is similar to counting the number
>>>> tuples in a
>>>> > bag in the lingo in the pig manual.  It was harder than I expected,
>>>> but I
>>>> > think this works:
>>>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
>>>> CassandraStorage()
>>>> > AS (key, columns: bag {T: tuple(name, value)});
>>>> > counts = FOREACH rows GENERATE COUNT(columns);
>>>> > counts_in_bag = GROUP counts ALL;
>>>> > sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
>>>> > dump sum_of_bag;
>>>> >
>>>> > My question is: am I right that it works?  I started with 3 keys
>>>> having a
>>>> > total of 5 columns and got (5).  Then I added a new key/column, and
>>>> another
>>>> > column on an existing key and got (7).  So, it seems like it's
>>>> working.
>>>> > But, was there a better way to write it?
>>>> >
>>>> > Thanks!
>>>> >
>>>> > will
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Will Oberman
>>> Civic Science, Inc.
>>> 3030 Penn Avenue., First Floor
>>> Pittsburgh, PA 15201
>>> (M) 412-480-7835
>>> (E) [EMAIL PROTECTED]
>>>
>>
>>
>>
>> --
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E) [EMAIL PROTECTED]
>>
>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) [EMAIL PROTECTED]
>

--
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) [EMAIL PROTECTED]