Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> prep for cassandra storage from pig


+
William Oberman 2011-06-15, 18:17
+
William Oberman 2011-06-15, 19:07
+
Jeremy Hanna 2011-06-15, 19:04
+
William Oberman 2011-06-15, 19:08
Copy link to this message
-
Re: prep for cassandra storage from pig
I'll do a reply all, to keep this more consistent (sorry!).

Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm
curious if I could have avoided it with proper pig scripting though.

On Wed, Jun 15, 2011 at 3:08 PM, William Oberman
<[EMAIL PROTECTED]>wrote:

> My problem is the column names are dynamic (a date), and pygmalion seems to
> want the column names to be fixed at "compile time" (the script).
>
>
> On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna <[EMAIL PROTECTED]>wrote:
>
>> Hi Will,
>>
>> That's partly why I like to use FromCassandraBag and ToCassandraBag from
>> pygmalion - it does the work for you to get it back into a form that
>> cassandra understands.
>>
>> Others may know better how to massage the data into that form using just
>> pig, but if all else fails, you could write a udf to do that.
>>
>> Jeremy
>>
>> On Jun 15, 2011, at 1:17 PM, William Oberman wrote:
>>
>> > I think I'm stuck on typing issues trying to store data in cassandra.
>>  To verify, cassandra wants (key, {tuples})
>> >
>> > My pig script is fairly brief:
>> > raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
>> (key:chararray, columns:bag {column:tuple (name, value)});
>> > --colums == timeUUID -> JSON
>> > rows = FOREACH raw GENERATE key, FLATTEN(columns);
>> > alias_target_day = FOREACH rows {
>> >     --I wrote a specialized parser that does exactly what I need
>> >     observation_map = com.civicscience.pig.ParseObservation($2);
>> >     GENERATE $0 as alias, observation_map#'_fqt' as target,
>> observation_map#'_day' as day;
>> > };
>> > grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
>> > X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
>> COUNT($1)) as day_count;
>> >
>> > This gets me:
>> > (targetA, (day1, count))
>> > (targetA, (day2, count))
>> > (targetB, (day1, count))
>> > ....
>> >
>> > But, cassandra wants the 2nd item to be a bag.  So, I tried:
>> > X = FOREACH grouping GENERATE group.$0 as target,
>> TOBAG(TOTUPLE(group.$1, COUNT($1))) as day_count;
>> >
>> > But this results in:
>> > (targetA, {((day1, count))})
>> > (targetA, {((day2, count))})
>> > (targetB, {((day1, count))})
>> > It's hard to see, but the 2nd item now has a nested tuple as the first
>> value, which is still bad.
>> >
>> > How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
>> cassandra), so I'm posting to the pig list too.
>> >
>> > will
>>
>>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) [EMAIL PROTECTED]
>

--
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) [EMAIL PROTECTED]
+
Jeremy Hanna 2011-06-15, 19:25