Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Using symlink in Pig to perform a join with the help of a python UDF


+
Swaroop Sharma 2013-06-27, 12:23
Copy link to this message
-
RE: Using symlink in Pig to perform a join with the help of a python UDF
Hi,
Re:
"Also, I know that I can join the two files on *Channel_Number *and thenfilter for records whose *date_time *is between *start_date*_time and *end_date_time*. However, this process takes a long time as my actual datacontains ~10 million rows in file1 with additional fields and file2contains ~ 50,000 records (using mapside join)."
I believe you could make the join run faster by first grouping the rows in each relation by channel number, then joining the two relations on channel number.
  Rodrick
> Date: Thu, 27 Jun 2013 17:53:16 +0530
> Subject: Using symlink in Pig to perform a join with the help of a python UDF
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
>
> Hi,
>
> I have two files whose content are as follows:
>
> *File 1:* Field Names: eventid,clientid,channel_number,date_time
>
> 114,00001,5003,4/15/2013 11:10
>
> 114,00001,5003,4/15/2013 1:22
>
> 100,00001,5003,4/15/2013 23:08
>
> 114,00002,5002,4/16/2013 8:55
> 100,00002,5002,4/16/2013 8:15
>
> *File 2:* Field Names: ChannelNumber,ProgramID,Start_Date_Time,End_Date_Time
>
> *5002,112311,4/16/2013 8:00,4/16/2013 8:30*
>
> *5002,124313,4/16/2013 8:30,4/16/2013 9:00*
>
> *5003,113214,4/15/2013 23:00,4/15/2013 23:30*
>
> *5003,123213,4/15/2013 1:00,4/15/2013 2:00*
>
> *5003,123343,4/15/2013 10:30,4/15/2013 11:30*
>
>
> I want to check if the *channel_number *from *File 1 *matches with
> *ChannelNumber
> *from *File 2* and if the *Date_time* from *File 1* is between *
> Start_Date_Time* and *End_Date_Time* from *File2*.
>
> ***Required Output:***
>
> *114,00001,5003,4/15/2013 11:10,5003,123343,4/15/2013 10:30,4/15/2013 11:30
> 114,00001,5003,4/15/2013 1:22,5003,123213,4/15/2013 1:00,4/15/2013 2:00
> 100,00001,5003,4/15/2013 23:08,5003,113214,4/15/2013 23:00,4/15/2013 23:30
> 114,00002,5002,4/16/2013 8:55,5002,124313,4/16/2013 8:30,4/16/2013 9:00
> 100,00002,5002,4/16/2013 8:15,5002,112311,4/16/2013 8:00,4/16/2013 8:30*
>
>
> I tried to write a UDF to perform this action and it is as follows:
> *Python UDF:*
>
> *def myudf(lst):
>  output = [];
>  f = open("epg");
>  for item in lst:
>   for line in f:
>    tup = eval(line);
>    if item[2]== tup[0] and item[3] >= tup[2] and item[3] <= tup[3]:
>     newtup = tuple(lst + tup);
>     output.append(newtup);
>  return output;
>  f.close();*
>
> I have created a Symbolic Link in Pig which links the second file. The
> procedure that I've used to create a symlink is as follows:
>
> *    pig -Dmapred.cache.files=hdfs://path/to/file/filename#**epg**
> -Dmapred.create.symlink=yes script.pig*
>
> The **script.pig** file contains the Pig script which is executed upon
> running the above code.
> The Pig script is as follows:
>
> *file1 = load '/user/swaroop.sharma/sampledata_for_udf_test/activitydata/'
> using PigStorage(',') as(eventid: chararray,clientid:
> chararray,channel_number: chararray,date_time: chararray);
>
>     register udf1.py using jython as test;
>
>     grp_file1 = group file1 by clientid;*
>
> *    finalfile = foreach grp_file1 generate group, test.myudf(file1) as
> file1: {(eventid: chararray,clientid: chararray,channelnumber:
> chararray,chararray,id: chararray,date_time: chararray,channelnumber:
> chararray,programid: chararray,start_date_time: chararray,end_date_time:
> chararray)};*
>
>     *store **finalfile** into '/path/where/file/will/be/stored/' using
> PigStorage(',');*
>
> My Python UDF gets registered successfully. I have also tested the UDF by
> running it on python with the same data by creating two lists.
> I am also able to load the file successfully.
>
> However, the *finalfile* is not being created. An error occurs while
> running the same.
>
> I have been trying to debug this from past two days with no success. I am
> new to Pig and currently, I am in a deadlock.
>
> Also, I know that I can join the two files on *Channel_Number *and then
> filter for records whose *date_time *is between *start_date*_time and *
> end_date_time*. However, this process takes a long time as my actual data