Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Small files


+
Anastasis Andronidis 2013-09-27, 14:36
+
Anastasis Andronidis 2013-09-30, 06:37
+
Ruslan Al-Fakikh 2013-09-30, 20:22
Copy link to this message
-
Re: Small files
Hi Anastasis,

Have you tried to mount hdfs as a local directory via hdfs-fuse? This might
get over your upload problem.

Thanks,
TianYi ZHU
On 30 September 2013 16:37, Anastasis Andronidis
<[EMAIL PROTECTED]>wrote:

> Hello again,
>
> any comments on this?
>
> Thanks,
> Anastasis
>
> On 27 Σεπ 2013, at 5:36 μ.μ., Anastasis Andronidis <
> [EMAIL PROTECTED]> wrote:
>
> > Hello,
> >
> > I am working on a very small project for my university and I have a
> small cluster with 2 worker nodes and 1 master node. I'm using Pig to do
> some calculations and I have a question regarding small files.
> >
> > I have a UDF that is reading a small input (around 200k) and correlates
> the data from HDFS. My first approach was to upload the small file onto
> HDFS and later, by using getCacheFiles(), access it into my UDF.
> >
> > After though, I needed to change things in this small file and this
> meant to delete the file on HDFS, re-upload it and re-run Pig. But in the
> end I need to change this small file frequently and I wanted to bypass HDFS
> (because all those read + write + read in pig again is very very slow for
> multiple iterations of my script), so what I did was:
> >
> > === pig script ==> > %declare MYFILE     `cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"}
> {print $0}'`
> >
> > .... MyUDF( line, '$MYFILE') .....
> >
> > In the beginning, it worked great. But later (when my file started to
> get larger of 100KB) on pig was stacking and I had to kill it:
> >
> > 2013-09-27 16:14:47,722 [main] INFO
>  org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
> cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'
> > ^C2013-09-27 16:15:28,102 [main] ERROR org.apache.pig.Main - ERROR 2999:
> Unexpected internal error. Error executing shell command: cat myfile.txt |
> awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'. Command exit with exit code of
> 130
> >
> > (btw is this a bug or something? should hung like that?)
> >
> > How can I manage small files in such cases so I don't need to re upload
> everything in HDFS every time and make my iteration faster?
> >
> > Thanks,
> > Anastasis
>
>