Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - FOREACH nested block aliases and output schema field names


Copy link to this message
-
FOREACH nested block aliases and output schema field names
Andy Schlaikjer 2012-03-15, 22:09
Hey all,

I have the following FOREACH with nested block:

```
node_in = FOREACH (GROUP edge BY destination_id) {
  in_degree = COUNT(edge);
  in_edges = edge.(source_id, weight);
  in_edges_sorted = ORDER in_edges BY weight DESC;
  in_indices = SomeUDF(in_edges_sorted.source_id);
  in_weights = AnotherUDF(in_edges_sorted.weight);
  GENERATE group AS id, in_degree, in_indices, in_weights;
}
```

It seems as though Pig doesn't reuse the aliases within the block
(e.g. "in_degree") when reporting the resulting schema for node_in.
This means fields $1.. have no names and must be referenced by
position in subsequent statements (d'oh).

One simple, though verbose, work-around is:

```
node_in = FOREACH (GROUP scored_edge_out_norm BY destination_id) {
  ...
  GENERATE group AS id, in_degree AS in_degree, in_indices AS
in_indices, in_weights AS in_weights;
}
```

So, what would it take to get Pig to reuse the nested block aliases as
field names in the output schema?

Cheers,
Andy