Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> dev How can I add a row number per input file to the data

Copy link to this message
Re: dev How can I add a row number per input file to the data

Probably these can help:
http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for

I've never tried this, but probably you could group by tagsource and then
apply RANK

On Fri, Aug 16, 2013 at 6:17 AM, Leo <[EMAIL PROTECTED]> wrote:

> Hi, I want to add a row/line number to the data I read from multiple CSVs.
> However I want the running number reflect the line number *per input file*,
> not overall.
> I am happy to write a Python UDF for this. So far I have in the UDF:
>     --- Python file udf.py ---
>     lineNum = 0
>     @outputSchema("lnum:int, f1:chararray")
>     def makeData(line):
>         global lineNum
>         lineNum += 1
>         return lineNum, line.tostring()
> which is called from Pig:
>     --- Pig file use-udf.pig ---
>     register 'udf.py' using jython as udfs;
>     data = load 'datadir' using TextLoader() as line;
>     udfified = foreach data generate udfs.makeData(line);
>     dump udfified;
> This approach works, *but* the running number increases over multiple
> files in the directory "datadir". That is *not* what I want! I need the row
> number starting with 1 for each file in datadir. Maybe I can reset the
> lineNum variable per input file?
> Any idea how to achieve this? Either with plain Pig or with Python UDFs?
> Many thanks, Leo