Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - dev How can I add a row number per input file to the data


+
Leo 2013-08-16, 02:17
Copy link to this message
-
Re: dev How can I add a row number per input file to the data
Ruslan Al-Fakikh 2013-08-21, 15:03
Hi!

Probably these can help:
http://pig.apache.org/docs/r0.11.1/basic.html#rank
http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for
-tagsource)

I've never tried this, but probably you could group by tagsource and then
apply RANK

Ruslan
On Fri, Aug 16, 2013 at 6:17 AM, Leo <[EMAIL PROTECTED]> wrote:

> Hi, I want to add a row/line number to the data I read from multiple CSVs.
> However I want the running number reflect the line number *per input file*,
> not overall.
>
> I am happy to write a Python UDF for this. So far I have in the UDF:
>
>     --- Python file udf.py ---
>     lineNum = 0
>
>     @outputSchema("lnum:int, f1:chararray")
>     def makeData(line):
>         global lineNum
>         lineNum += 1
>         return lineNum, line.tostring()
>
> which is called from Pig:
>
>     --- Pig file use-udf.pig ---
>     register 'udf.py' using jython as udfs;
>
>     data = load 'datadir' using TextLoader() as line;
>     udfified = foreach data generate udfs.makeData(line);
>
>     dump udfified;
>
> This approach works, *but* the running number increases over multiple
> files in the directory "datadir". That is *not* what I want! I need the row
> number starting with 1 for each file in datadir. Maybe I can reset the
> lineNum variable per input file?
>
> Any idea how to achieve this? Either with plain Pig or with Python UDFs?
>
> Many thanks, Leo
>
+
Pradeep Gollakota 2013-08-21, 15:32