Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Concatenate strings within a group

Copy link to this message
Re: Concatenate strings within a group
At least I am not aware of a PIG command which can do this. You can start
by grouping on 'id',  and then try flattening the 'text' field. But then
you run into the issue that you have lost the sorting order ('seg_no')
which is required to construct a meaningful sentence. Here I think you need
UDF where you pass both 'seq_no' and 'text' and do the work.

I can think of doing some convoluted processing like concatenating the
'seg_no' and 'text' fields as one and then grouping on 'id' and then
sorting on the new concatenated field within the group. But then once,
you've done that, you will have to split back the combined field again. And
doing all this might not help either. The main thing here is that, as far
as I know, you cannot impose sort order in a bag or while flattening a
group in one row. I would be interested to know if this is possible through
native Pig.

On Sat, Jul 13, 2013 at 9:45 PM, Karthik Natarajan <

> Hi,
> I'm new to Pig. I have a file that contains the contents of documents. The
> problem is that the contents are not in one line of the file. The file is
> actually an export of a database table. Below is an example of the table:
> id seg_no  text
> -- -----  -----
> 1  0      This is
> 1  1      a
> 1  2      test for
> 1  3      Hello
> 1  4      World!
> 2  0      Test
> 2  1      number
> 2  2      two.
> How do I get an output like this:
> id  text
> --  ----
> 1   This is a test for Hello World!
> 2   Test number two.
> I can do this in SQL, but I want to try it using Hadoop and Pig. I'm not
> sure how to concatenate values of a column w/in a group. I wondering if
> Pig's built-in functions can handle this or if I have to create a UDF. I'm
> thinking I need to create a UDF, but am not sure how to go about this. Any
> help/advice would be appreciated.
> Thanks.