Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> JOIN not printing properly


Copy link to this message
-
Re: JOIN not printing properly
Have you taken a look at Pygmalion
(http://github.com/jeromatron/pygmalion) which makes it MUCH easier to
work with tabular data from Cassandra like you're trying to do?

For example:

what_cassandrastorage_should_really_produce = FOREACH rows GENERATE key
AS key, FromCassandraBag('url,cache_hit', columns) AS (url:chararray,
cache_hit:chararray);

DUMP what_cassandrastorage_should_really_produce;

(key1, http://www.google.com, hit)
(key2, http://www.google.com, hit)

Does that work for your use case?

--jacob
@thedatachef
On Fri, 2011-11-04 at 08:51 -0400, AD wrote:
> Hello,
>
>  I am pulling data from cassandra into pig which means it ends up like key,
> bag { (name,value),(name,value) }.  The info is logfiles so each column is
> a field in server logfile (like apache).  I have the following pig to
> combine 2 fields and count them but the GENERATE of the JOIN is not
> printing the right field.  Is there an easier way to solve this, and does
> anyone know why the join output is broken ?
>
> rows = LOAD 'cassandra://Keyspace1/Logs' USING CassandraStorage() AS (key,
> columns: bag {T: tuple(name, value)});
>
>  A = FOREACH rows GENERATE $0, flatten($1) ; //FLATTEN
> *(key1,url,http://www.google.com)*
> *(key1,cache_hit,hit)*
> *(key2,url,http://www.google.com)*
> *(key2,cache_hit,miss)*
>
>  B = group r2 by key ; // Combine url and cache_hit into one record
> *(key1,{(key1,url,http://www.google.com),(key1,cache_hit,hit)})*
> *(key2,{(key2,url,http://www.google.com),(key2,cache_hit,miss)})*
>
>  // Create 2 lists and then JOIN them
>
>  C = FOREACH B {
>  u = FILTER A by name == 'url';
>  GENERATE FLATTEN(u.(key,value)) ;
>  }
> * (key1,http://www.google.com)*
> * (key2,http://www.google.com)*
>
>  D = FOREACH B {
>  u2 = FILTER A by name == 'cache_hit';
>  GENERATE FLATTEN(u2.(key,value));
>  }
>  *(key1,hit)*
> * (key2,miss)*
>
>  E = join C by key, D by key ;
> *(key1,http://www.google.com,key1,hit)*
> *(key2,http://www.google.com,key2,miss)*
>
> describe E ;
> E: {C::u::key: chararray,C::u::value: chararray,D::u2::key:
> chararray,D::u2::value: chararray}
>
> F = FOREACH E GENERATE C::u::value, D::u2::value ;
>
> *dump F ;*
> *(http://www.google.com,http://www.google.com)  ?? Why not www.google.com,
> hit ????*
> *(http://www.google.com,http://www.google.com)*
> *
> *
> Any help appreciated.
> AD
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB