|
|
-
how to pass a hdfs file to a c++ process
Zhixuan Zhu 2011-08-22, 19:57
Hi All,
I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question about FileInputFormat a few days ago and get some prompt replys from this forum and it helped a lot. Thanks again! Now I have another question. I'm trying to invoke a C++ process from my mapper for each hdfs file in the input directory to achieve some parallel processing. But how do I pass the file to the program? I would want to do something like the following in my mapper:
Process lChldProc = Runtime.getRuntime().exec("myprocess -file $filepath");
How do I pass the hdfs filesystem to an outside process like that? Is HadoopStreaming the direction I should go?
Thanks very much for any reply in advance.
Best, Grace
-
Re: how to pass a hdfs file to a c++ process
Robert Evans 2011-08-23, 14:48
Hadoop streaming is the simplest way to do this, if you program is set up to take stdin as its input, write to stdout for the output, and each record "file" in your case is a single line of text.
You need to be able to have it work with the following shell script
Hadoop fs -cat <input_file> | head -1 | ./myprocess > output.txt
And ideally what is stored in output.txt are lines of text that can have their order rearranged without impacting the result (This is not a requirement unless you want to use a reduce too, but streaming will still try to parse it that way.
If not there are tricks you can play to make it work, but they are kind of ugly.
--Bobby Evans On 8/22/11 2:57 PM, "Zhixuan Zhu" <[EMAIL PROTECTED]> wrote:
Hi All,
I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question about FileInputFormat a few days ago and get some prompt replys from this forum and it helped a lot. Thanks again! Now I have another question. I'm trying to invoke a C++ process from my mapper for each hdfs file in the input directory to achieve some parallel processing. But how do I pass the file to the program? I would want to do something like the following in my mapper:
Process lChldProc = Runtime.getRuntime().exec("myprocess -file $filepath");
How do I pass the hdfs filesystem to an outside process like that? Is HadoopStreaming the direction I should go?
Thanks very much for any reply in advance.
Best, Grace
-
Re: how to pass a hdfs file to a c++ process
Arun C Murthy 2011-08-23, 14:51
On Aug 22, 2011, at 12:57 PM, Zhixuan Zhu wrote:
> Hi All, > > I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question > about FileInputFormat a few days ago and get some prompt replys from > this forum and it helped a lot. Thanks again! Now I have another > question. I'm trying to invoke a C++ process from my mapper for each > hdfs file in the input directory to achieve some parallel processing.
That seems weird - why aren't you using more maps and one file per-map?
> But how do I pass the file to the program? I would want to do something > like the following in my mapper:
IAC, libhdfs is one way to do HDFS ops via c/c++.
Arun
> > Process lChldProc = Runtime.getRuntime().exec("myprocess -file > $filepath"); > > How do I pass the hdfs filesystem to an outside process like that? Is > HadoopStreaming the direction I should go? > > Thanks very much for any reply in advance. > > Best, > Grace
-
RE: how to pass a hdfs file to a c++ process
Zhixuan Zhu 2011-08-23, 14:59
I'll actually invoke one executable from each of my map. Because this C++ program has been implemented and used in the past, I just want to integrate it to our Hadoop map/reduce without having to re-implement the process in java. So my map is going to be very simple with just calling the process and pass the input files.
Thanks, Grace
-----Original Message----- From: Arun C Murthy [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 23, 2011 9:51 AM To: [EMAIL PROTECTED] Subject: Re: how to pass a hdfs file to a c++ process On Aug 22, 2011, at 12:57 PM, Zhixuan Zhu wrote:
> Hi All, > > I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question > about FileInputFormat a few days ago and get some prompt replys from > this forum and it helped a lot. Thanks again! Now I have another > question. I'm trying to invoke a C++ process from my mapper for each > hdfs file in the input directory to achieve some parallel processing.
That seems weird - why aren't you using more maps and one file per-map?
> But how do I pass the file to the program? I would want to do something > like the following in my mapper:
IAC, libhdfs is one way to do HDFS ops via c/c++.
Arun
> > Process lChldProc = Runtime.getRuntime().exec("myprocess -file > $filepath"); > > How do I pass the hdfs filesystem to an outside process like that? Is > HadoopStreaming the direction I should go? > > Thanks very much for any reply in advance. > > Best, > Grace
-
Re: how to pass a hdfs file to a c++ process
Arun Murthy 2011-08-23, 15:35
That is a normal use case.
I'd encourage you to use Java MR (even pig/hive).
If you really want to use your legacy app use streaming with a map cmd such as 'hadoop fs -cat <file> | mylegacyexe'
Arun
Sent from my iPhone
On Aug 23, 2011, at 8:00 AM, Zhixuan Zhu <[EMAIL PROTECTED]> wrote:
> I'll actually invoke one executable from each of my map. Because this > C++ program has been implemented and used in the past, I just want to > integrate it to our Hadoop map/reduce without having to re-implement the > process in java. So my map is going to be very simple with just calling > the process and pass the input files. > > Thanks, > Grace > > -----Original Message----- > From: Arun C Murthy [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 23, 2011 9:51 AM > To: [EMAIL PROTECTED] > Subject: Re: how to pass a hdfs file to a c++ process > > > On Aug 22, 2011, at 12:57 PM, Zhixuan Zhu wrote: > >> Hi All, >> >> I'm using hadoop-0.20.2 to try out some simple tasks. I asked a > question >> about FileInputFormat a few days ago and get some prompt replys from >> this forum and it helped a lot. Thanks again! Now I have another >> question. I'm trying to invoke a C++ process from my mapper for each >> hdfs file in the input directory to achieve some parallel processing. > > That seems weird - why aren't you using more maps and one file per-map? > >> But how do I pass the file to the program? I would want to do > something >> like the following in my mapper: > > IAC, libhdfs is one way to do HDFS ops via c/c++. > > Arun > >> >> Process lChldProc = Runtime.getRuntime().exec("myprocess -file >> $filepath"); >> >> How do I pass the hdfs filesystem to an outside process like that? Is >> HadoopStreaming the direction I should go? >> >> Thanks very much for any reply in advance. >> >> Best, >> Grace >
-
RE: how to pass a hdfs file to a c++ process
Zhixuan Zhu 2011-08-23, 15:51
Thank you very much!
'hadoop fs -cat <file> | mylegacyexe' is exactly the kind of method I came up with and was going to try it out. I'm glad to hear that it's actually an "official" alternative.
Thanks again. This is a great forum! Grace -----Original Message----- From: Arun Murthy [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 23, 2011 10:36 AM To: [EMAIL PROTECTED] Subject: Re: how to pass a hdfs file to a c++ process
That is a normal use case.
I'd encourage you to use Java MR (even pig/hive).
If you really want to use your legacy app use streaming with a map cmd such as 'hadoop fs -cat <file> | mylegacyexe'
Arun
Sent from my iPhone
On Aug 23, 2011, at 8:00 AM, Zhixuan Zhu <[EMAIL PROTECTED]> wrote:
> I'll actually invoke one executable from each of my map. Because this > C++ program has been implemented and used in the past, I just want to > integrate it to our Hadoop map/reduce without having to re-implement the > process in java. So my map is going to be very simple with just calling > the process and pass the input files. > > Thanks, > Grace > > -----Original Message----- > From: Arun C Murthy [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 23, 2011 9:51 AM > To: [EMAIL PROTECTED] > Subject: Re: how to pass a hdfs file to a c++ process > > > On Aug 22, 2011, at 12:57 PM, Zhixuan Zhu wrote: > >> Hi All, >> >> I'm using hadoop-0.20.2 to try out some simple tasks. I asked a > question >> about FileInputFormat a few days ago and get some prompt replys from >> this forum and it helped a lot. Thanks again! Now I have another >> question. I'm trying to invoke a C++ process from my mapper for each >> hdfs file in the input directory to achieve some parallel processing. > > That seems weird - why aren't you using more maps and one file per-map? > >> But how do I pass the file to the program? I would want to do > something >> like the following in my mapper: > > IAC, libhdfs is one way to do HDFS ops via c/c++. > > Arun > >> >> Process lChldProc = Runtime.getRuntime().exec("myprocess -file >> $filepath"); >> >> How do I pass the hdfs filesystem to an outside process like that? Is >> HadoopStreaming the direction I should go? >> >> Thanks very much for any reply in advance. >> >> Best, >> Grace >
|
|