Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Q on loading data from a directory


Copy link to this message
-
RE: Q on loading data from a directory
Olga Natkovich 2009-05-21, 17:44
Ricky,

You are right - the loader does not contain sufficient information to
implement this feature. You will need to build a custom slicer for this.

You can see how to create a slicer in
http://hadoop.apache.org/pig/docs/r0.2.0/udf.html#Advanced+Topics

Olga

> -----Original Message-----
> From: Ricky Ho [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, May 21, 2009 9:41 AM
> To: [EMAIL PROTECTED]
> Subject: RE: Q on loading data from a directory
>
> What I am looking for is to have the "filename" appears in the tuple.
>
> Using Marshall's suggested script, I can load all files of a
> directory, but unfortunately the filename does not appear in
> the tuple.
>
> grunt> myData = LOAD 'mydir/*.txt' AS (id, data); dump myData;
> (line1 of fileA)
> (line2 of fileA)
> (line1 of fileB)
> (line2 of fileB)
>
> But the following is what I want ...
>
> grunt> myData = LOAD 'mydir/*.txt' USING MagicLoader() AS (id, data);
> grunt> dump myData;
> (fileA.txt, (line1 of fileA))
> (fileA.txt, (line2 of fileA))
> (fileB.txt, (line1 of fileB))
> (fileB.txt, (line2 of fileB))
>
> I am trying to implement a number of classical Text
> processing algorithm using PIG to showcase how easy this can
> be done using a higher level language than Hadoop Java.  
> Currently I am stuck in some algorithm such as "TF/IDF" and
> "Inverted Index" which requires the name of the filename.
>
> I think "inverted index" is a very common Map/Reduce use
> case.  I am quite surprised that PIG doesn't have a mechanism
> to determine the filename where the tuple is coming from.
>
> Rgds,
> Ricky
>
> =============================================================> =========> Ricky,
>
> Sorry for misunderstanding your question, what you need is
> only the file name. But I think you can implement it using
> the similar way what I proposed in the last email.
>
> But I still believe a customer loader to load specified files
> is needed.
>
>
> -----Original Message-----
> From: zjffdu [mailto:[EMAIL PROTECTED]]
> Sent: 2009  5  21   21:40
> To: '[EMAIL PROTECTED]'; 'Olga Natkovich'
> Subject: RE: Unable to get individual filename
>
> I also think it is useful to provide another kind of
> PigStorage to load part of the files in one folder, such as
> load all the .txt files in the root folder,
>
> And I think the best way to do is put the code here:  
> (POLoad.java  Line 92)
>
>     public void setUp() throws IOException{
>         String filename = lFile.getFileName();
>         loader > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
>
>         is = FileLocalizer.open(filename, pc);        //  
> this is the place
> I can control what kinds of files I can load, the default is
> loading all the files.
>
>         loader.bindTo(filename , new
> BufferedPositionedInputStream(is), 0, Long.MAX_VALUE);
>     }
>
>
> In my opinion, I can provide a different "is" regarding the
> FuncSpec the pig scripts provide, this is code snippet I'd
> like to change it to be:
>
>     public void setUp() throws IOException{
>         String filename = lFile.getFileName();
>         loader > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
>
>                if
> (IFile.getFuncSpec().getClassName.equals(ExtPigStorage.class.g
> etName()){
>                        String[] ext=IFile.getFuncSpec. getCtorArgs();
>                        is = FileLocalizer.open(filename,pc,ext);
>                } else{
>                is = FileLocalizer.open(filename, pc);
>         }
>         loader.bindTo(filename , new
> BufferedPositionedInputStream(is), 0, Long.MAX_VALUE);
>     }
>
>
>
> I can create a sub class of PigStorage called ExtPigStorage,
>
> What I need to do is provide my a different kind of "is"
> which can control what files to load,
>
> Olga, What do you think about my proposal ?
>
> If you feel it's OK, I can create a JIRA item and give the patch.
>
>
> Thank you.
>
>
> Jeff Zhang
>
>
>
>
> ---