Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Q on loading data from a directory


Copy link to this message
-
RE: Q on loading data from a directory
Ricky,

You are right - the loader does not contain sufficient information to
implement this feature. You will need to build a custom slicer for this.

You can see how to create a slicer in
http://hadoop.apache.org/pig/docs/r0.2.0/udf.html#Advanced+Topics

Olga

> -----Original Message-----
> From: Ricky Ho [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, May 21, 2009 9:41 AM
> To: [EMAIL PROTECTED]
> Subject: RE: Q on loading data from a directory
>
> What I am looking for is to have the "filename" appears in the tuple.
>
> Using Marshall's suggested script, I can load all files of a
> directory, but unfortunately the filename does not appear in
> the tuple.
>
> grunt> myData = LOAD 'mydir/*.txt' AS (id, data); dump myData;
> (line1 of fileA)
> (line2 of fileA)
> (line1 of fileB)
> (line2 of fileB)
>
> But the following is what I want ...
>
> grunt> myData = LOAD 'mydir/*.txt' USING MagicLoader() AS (id, data);
> grunt> dump myData;
> (fileA.txt, (line1 of fileA))
> (fileA.txt, (line2 of fileA))
> (fileB.txt, (line1 of fileB))
> (fileB.txt, (line2 of fileB))
>
> I am trying to implement a number of classical Text
> processing algorithm using PIG to showcase how easy this can
> be done using a higher level language than Hadoop Java.  
> Currently I am stuck in some algorithm such as "TF/IDF" and
> "Inverted Index" which requires the name of the filename.
>
> I think "inverted index" is a very common Map/Reduce use
> case.  I am quite surprised that PIG doesn't have a mechanism
> to determine the filename where the tuple is coming from.
>
> Rgds,
> Ricky
>
> =============================================================> =========> Ricky,
>
> Sorry for misunderstanding your question, what you need is
> only the file name. But I think you can implement it using
> the similar way what I proposed in the last email.
>
> But I still believe a customer loader to load specified files
> is needed.
>
>
> -----Original Message-----
> From: zjffdu [mailto:[EMAIL PROTECTED]]
> Sent: 2009  5  21   21:40
> To: '[EMAIL PROTECTED]'; 'Olga Natkovich'
> Subject: RE: Unable to get individual filename
>
> I also think it is useful to provide another kind of
> PigStorage to load part of the files in one folder, such as
> load all the .txt files in the root folder,
>
> And I think the best way to do is put the code here:  
> (POLoad.java  Line 92)
>
>     public void setUp() throws IOException{
>         String filename = lFile.getFileName();
>         loader > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
>
>         is = FileLocalizer.open(filename, pc);        //  
> this is the place
> I can control what kinds of files I can load, the default is
> loading all the files.
>
>         loader.bindTo(filename , new
> BufferedPositionedInputStream(is), 0, Long.MAX_VALUE);
>     }
>
>
> In my opinion, I can provide a different "is" regarding the
> FuncSpec the pig scripts provide, this is code snippet I'd
> like to change it to be:
>
>     public void setUp() throws IOException{
>         String filename = lFile.getFileName();
>         loader > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
>
>                if
> (IFile.getFuncSpec().getClassName.equals(ExtPigStorage.class.g
> etName()){
>                        String[] ext=IFile.getFuncSpec. getCtorArgs();
>                        is = FileLocalizer.open(filename,pc,ext);
>                } else{
>                is = FileLocalizer.open(filename, pc);
>         }
>         loader.bindTo(filename , new
> BufferedPositionedInputStream(is), 0, Long.MAX_VALUE);
>     }
>
>
>
> I can create a sub class of PigStorage called ExtPigStorage,
>
> What I need to do is provide my a different kind of "is"
> which can control what files to load,
>
> Olga, What do you think about my proposal ?
>
> If you feel it's OK, I can create a JIRA item and give the patch.
>
>
> Thank you.
>
>
> Jeff Zhang
>
>
>
>
> ---
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB