Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> trying to count all tuples


+
William Oberman 2011-06-03, 19:53
+
Dmitriy Ryaboy 2011-06-03, 20:06
+
William Oberman 2011-06-03, 20:09
Copy link to this message
-
Re: trying to count all tuples
I tried this same script on closer to production data, and I'm getting
errors.  I'm 50% sure it's this:
https://issues.apache.org/jira/browse/PIG-1283

One of my rows in cassandra has no columns (maybe?), which maybe causes a
null bag, which causes COUNT to blow up (at least, that's my theory).  As a
workaround, can I have COUNT ignore/skip rows with null columns?  I'll start
digging through the docs as well.

will

On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <[EMAIL PROTECTED]>wrote:

> That is exactly what I wanted, thanks for the confirm!
>
>
> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
>
>> I am not sure what you mean by "count all columns". The code you have
>> counts all *cells*.
>> So:
>> id1: col1, col2
>> id2: col1, col2, col3
>>
>> has 3 columns in a conventional sense, but your code will return 5. Is
>> that what you want? If so, your code seems correct.
>>
>> D
>>
>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
>> <[EMAIL PROTECTED]> wrote:
>> > Howdy,
>> >
>> > I'm coming from cassandra, and I'm actually trying to count all columns
>> in a
>> > column family.  I believe that is similar to counting the number tuples
>> in a
>> > bag in the lingo in the pig manual.  It was harder than I expected, but
>> I
>> > think this works:
>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
>> CassandraStorage()
>> > AS (key, columns: bag {T: tuple(name, value)});
>> > counts = FOREACH rows GENERATE COUNT(columns);
>> > counts_in_bag = GROUP counts ALL;
>> > sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
>> > dump sum_of_bag;
>> >
>> > My question is: am I right that it works?  I started with 3 keys having
>> a
>> > total of 5 columns and got (5).  Then I added a new key/column, and
>> another
>> > column on an existing key and got (7).  So, it seems like it's working.
>> > But, was there a better way to write it?
>> >
>> > Thanks!
>> >
>> > will
>> >
>>
>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) [EMAIL PROTECTED]
>

--
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) [EMAIL PROTECTED]
+
William Oberman 2011-06-07, 20:58
+
William Oberman 2011-06-08, 20:56
+
Dmitriy Ryaboy 2011-06-08, 21:31
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB