Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> prep for cassandra storage from pig


Copy link to this message
-
prep for cassandra storage from pig
I think I'm stuck on typing issues trying to store data in cassandra.  To
verify, cassandra wants (key, {tuples})

My pig script is fairly brief:
raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
(key:chararray, columns:bag {column:tuple (name, value)});
--colums == timeUUID -> JSON
rows = FOREACH raw GENERATE key, FLATTEN(columns);
alias_target_day = FOREACH rows {
    --I wrote a specialized parser that does exactly what I need
    observation_map = com.civicscience.pig.ParseObservation($2);
    GENERATE $0 as alias, observation_map#'_fqt' as target,
observation_map#'_day' as day;
};
grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
COUNT($1)) as day_count;

This gets me:
(targetA, (day1, count))
(targetA, (day2, count))
(targetB, (day1, count))
....

But, cassandra wants the 2nd item to be a bag.  So, I tried:
X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
COUNT($1))) as day_count;

But this results in:
(targetA, {((day1, count))})
(targetA, {((day2, count))})
(targetB, {((day1, count))})
It's hard to see, but the 2nd item now has a nested tuple as the first
value, which is still bad.

How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
cassandra), so I'm posting to the pig list too.

will
+
William Oberman 2011-06-15, 19:07
+
Jeremy Hanna 2011-06-15, 19:04
+
William Oberman 2011-06-15, 19:08
+
William Oberman 2011-06-15, 19:10
+
Jeremy Hanna 2011-06-15, 19:25
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB