Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Using symlink in Pig to perform a join with the help of a python UDF


Copy link to this message
-
RE: Using symlink in Pig to perform a join with the help of a python UDF
Hi,
Re:
"Also, I know that I can join the two files on *Channel_Number *and thenfilter for records whose *date_time *is between *start_date*_time and *end_date_time*. However, this process takes a long time as my actual datacontains ~10 million rows in file1 with additional fields and file2contains ~ 50,000 records (using mapside join)."
I believe you could make the join run faster by first grouping the rows in each relation by channel number, then joining the two relations on channel number.
  Rodrick
> Date: Thu, 27 Jun 2013 17:53:16 +0530
> Subject: Using symlink in Pig to perform a join with the help of a python UDF
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
>
> Hi,
>
> I have two files whose content are as follows:
>
> *File 1:* Field Names: eventid,clientid,channel_number,date_time
>
> 114,00001,5003,4/15/2013 11:10
>
> 114,00001,5003,4/15/2013 1:22
>
> 100,00001,5003,4/15/2013 23:08
>
> 114,00002,5002,4/16/2013 8:55
> 100,00002,5002,4/16/2013 8:15
>
> *File 2:* Field Names: ChannelNumber,ProgramID,Start_Date_Time,End_Date_Time
>
> *5002,112311,4/16/2013 8:00,4/16/2013 8:30*
>
> *5002,124313,4/16/2013 8:30,4/16/2013 9:00*
>
> *5003,113214,4/15/2013 23:00,4/15/2013 23:30*
>
> *5003,123213,4/15/2013 1:00,4/15/2013 2:00*
>
> *5003,123343,4/15/2013 10:30,4/15/2013 11:30*
>
>
> I want to check if the *channel_number *from *File 1 *matches with
> *ChannelNumber
> *from *File 2* and if the *Date_time* from *File 1* is between *
> Start_Date_Time* and *End_Date_Time* from *File2*.
>
> ***Required Output:***
>
> *114,00001,5003,4/15/2013 11:10,5003,123343,4/15/2013 10:30,4/15/2013 11:30
> 114,00001,5003,4/15/2013 1:22,5003,123213,4/15/2013 1:00,4/15/2013 2:00
> 100,00001,5003,4/15/2013 23:08,5003,113214,4/15/2013 23:00,4/15/2013 23:30
> 114,00002,5002,4/16/2013 8:55,5002,124313,4/16/2013 8:30,4/16/2013 9:00
> 100,00002,5002,4/16/2013 8:15,5002,112311,4/16/2013 8:00,4/16/2013 8:30*
>
>
> I tried to write a UDF to perform this action and it is as follows:
> *Python UDF:*
>
> *def myudf(lst):
>  output = [];
>  f = open("epg");
>  for item in lst:
>   for line in f:
>    tup = eval(line);
>    if item[2]== tup[0] and item[3] >= tup[2] and item[3] <= tup[3]:
>     newtup = tuple(lst + tup);
>     output.append(newtup);
>  return output;
>  f.close();*
>
> I have created a Symbolic Link in Pig which links the second file. The
> procedure that I've used to create a symlink is as follows:
>
> *    pig -Dmapred.cache.files=hdfs://path/to/file/filename#**epg**
> -Dmapred.create.symlink=yes script.pig*
>
> The **script.pig** file contains the Pig script which is executed upon
> running the above code.
> The Pig script is as follows:
>
> *file1 = load '/user/swaroop.sharma/sampledata_for_udf_test/activitydata/'
> using PigStorage(',') as(eventid: chararray,clientid:
> chararray,channel_number: chararray,date_time: chararray);
>
>     register udf1.py using jython as test;
>
>     grp_file1 = group file1 by clientid;*
>
> *    finalfile = foreach grp_file1 generate group, test.myudf(file1) as
> file1: {(eventid: chararray,clientid: chararray,channelnumber:
> chararray,chararray,id: chararray,date_time: chararray,channelnumber:
> chararray,programid: chararray,start_date_time: chararray,end_date_time:
> chararray)};*
>
>     *store **finalfile** into '/path/where/file/will/be/stored/' using
> PigStorage(',');*
>
> My Python UDF gets registered successfully. I have also tested the UDF by
> running it on python with the same data by creating two lists.
> I am also able to load the file successfully.
>
> However, the *finalfile* is not being created. An error occurs while
> running the same.
>
> I have been trying to debug this from past two days with no success. I am
> new to Pig and currently, I am in a deadlock.
>
> Also, I know that I can join the two files on *Channel_Number *and then
> filter for records whose *date_time *is between *start_date*_time and *
> end_date_time*. However, this process takes a long time as my actual data
     
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB