|
Per Steffensen
2011-08-26, 11:13
MONTMORY Alain
2011-08-26, 14:43
Peyman Mohajerian
2011-08-26, 15:47
Per Steffensen
2011-08-29, 08:38
Per Steffensen
2011-08-29, 08:48
Per Steffensen
2011-08-29, 09:04
arvind@...)
2011-08-29, 15:24
MONTMORY Alain
2011-08-29, 18:12
Per Steffensen
2011-08-30, 06:41
|
-
From a newbie: Questions and will MapReduce fit our needsPer Steffensen 2011-08-26, 11:13
Hi
We are considering to use MapReduce for a project. I am participating in an "investigation"-phase where we try to reveal if we would benefit from using the MapReduce framework. A little bit about the project: We will be receiving data from the "outside world" in files via FTP. It will be a mix of very small files (50 records/lines) and very big files (5mio+ records/lines). The FTP server will be running in a DMZ where we have no plans of using any Hadoop technology. For every file arriving over FTP we will add a message (just pointing to that file) to a MQ also running in DMZ - how we do that is not relevant for my questions here. In the secure zone of our system we plan to run many machines (shards if you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o. to "load" (storing i db, indexing etc.) the files pointed to by the messages they receive from the MQ. For resonably small files they will probably just do the "loading" of the entire file themselves. For very big files we would like to have more machines/shards, than the single machine/shard that happens to receive the corresponding message, participating in "loading" that particular file. Questions: - In general, do you think MapReduce will be beneficial for us to use? Please remember that the files to be "loaded" does not live on a HDFS. Any descriptions on why you would suggest that we use MapReduce will be very velcome. - Reading about MapReduce it sounds to be a general framework able to split a "big job" into many smaller "sub-jobs", and have those "sub-jobs" executed concurrently (potentially on other different machines), all-in-all to complete the "big job". This could be used for many other things than "working with files", but then again examples and some of the descriptions makes it sound like it is all only about "jobs working with files". Is MapReduce only usefull/concerned with "jobs" related to "working with files" or is it more general-purpose so that it is usefull for any split-big-job-into-many-smaller-jobs-and-have-those-executed-in-parallel-problem? - I believe we will end up having a HDFS over the disks on the machines/shards in secure zone. Is HDFS a "must have" for MapReduce to work at all? E.g. HDFS might be the way sub-jobs are distributed and/or persisted (so that they will not be forgotten i case of a shard breakdown or something). - I think it sounds like an overhead to copy the big file (it will have to be deleted after succesful "loading") from the FTP server disk in DMZ to the HDFS in secure zone, just to be able to use MapReduce to distribute the work of "loading" it. We might want to do it in way so that each "sub-job" (of a "big job" about loading e.g. a big file big.txt) just points to big.txt together with from- and to- indexes into the file. Each "sub-job" will then have to only read the part of big.txt from from-index to to-index and "load" that. Will we be able to do something like that using MapReduce or is it all kind of "based on operating on files on the HDFS"? - Depending on the answer to the above question, we might want to be able to make the disk on the FTP server "join" the HDFS, in a way so that it is visible, but in a way so that data on it will not get copied in several copies (for redundancy matters) thoughout the disks on the shards (the "real" part of the HDFS) - remember the file will have to be deleted as soon as it has been "loaded". Is there such a concept/possibility of making "external" disk visible from HDFS, to enable MapReduce to work on files on such disks, without the files on such disks automatically will be copied to several different other disks (on the shards)? - As it understand it, each "sub-job" (the result of the split-operation) will be run on new dedicated JVM. It sounds like a big overhead to start a new JVM just to run a "small" job. Is it correct that each "sub-job" will run on its own new JVM that has to be started for that purpose only? If yes, it seems to me like the overhead is only "worth it" for fairly large "sub-jobs". Do you agree? If yes, I find the "WordCount" example on http://hadoop.apache.org/common/docs/current/mapred_tutorial.html kinda stupid, because it seems like each "sub-job" is only about handling one single line, and that seems to me to be way too small "sub-jobs" to make it "worth the effort" to move it to a remote machine and start a new JVM to handle it. Do you agree that it is stupid (yes, it is just an example, I know), or what did I miss? - Finally with respect to side effects. When handling the files we plan to load the records in the files into some kind of database (maybe several instances of a database). It is important that each record will only get inserted into one database once. As I understand it, MapReduce will make every "sub-job" run in several instances concurrently on several different machines, in order to make sure that it is finished quickly even if one of the attempts to handle the particular "sub-job" fails. It that true? If yes, isnt that a big problem with respect to "sub-jobs" with side effects (like inserting into a database)? Or are there some kind of build-in assumption that all side effects are done on HDFS and that HDFS supports some kind of transaction-handling so that it is easy for MapReduce to rollback the side effects of one of the "identical" sub-jobs if two should both succeed? In general, is it a build-in thing that each sub-job is running in one single transaction, so that it is not possible that a sub-job will "partly" succeed and "partly" fail (e.g. if it has to load 10000 records into a database, and succeeds with 9999 of those it might be stupud to roll it all back in order to try it all all-over again) I know my english is not perfect, but I hope you at least get the essence of my questions. I hope you will try to answer all the questions, even though some of them might seem stupid t
-
RE: From a newbie: Questions and will MapReduce fit our needsMONTMORY Alain 2011-08-26, 14:43
Hi,
I am going to try to response to your response in the text. I am not an hadoop expert but we are facing the same kind of problem (dealing with file which are external to HDFS) in our project and we use hadoop. [@@THALES GROUP RESTRICTED@@] -----Message d'origine----- De : Per Steffensen [mailto:[EMAIL PROTECTED]] Envoyé : vendredi 26 août 2011 13:13 À : [EMAIL PROTECTED] Objet : From a newbie: Questions and will MapReduce fit our needs Hi We are considering to use MapReduce for a project. I am participating in an "investigation"-phase where we try to reveal if we would benefit from using the MapReduce framework. A little bit about the project: We will be receiving data from the "outside world" in files via FTP. It will be a mix of very small files (50 records/lines) and very big files (5mio+ records/lines). The FTP server will be running in a DMZ where we have no plans of using any Hadoop technology. For every file arriving over FTP we will add a message (just pointing to that file) to a MQ also running in DMZ - how we do that is not relevant for my questions here. In the secure zone of our system we plan to run many machines (shards if you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o. to "load" (storing i db, indexing etc.) the files pointed to by the messages they receive from the MQ. For resonably small files they will probably just do the "loading" of the entire file themselves. For very big files we would like to have more machines/shards, than the single machine/shard that happens to receive the corresponding message, participating in "loading" that particular file. Questions: - In general, do you think MapReduce will be beneficial for us to use? Please remember that the files to be "loaded" does not live on a HDFS. Any descriptions on why you would suggest that we use MapReduce will be very velcome. Response : Yes because you could treat the "big file" in parallel and the parallesisation done by hadoop is very effective. To treat your file you need to have an InputFormat class which is able to read it. Here, two solutions : 1. you copy your file inside the HDFS file system and you use "FileInputFormat" (for text based file some are already produced by hadoop). inconvenient the copy may be long...(in our case it is unacceptable) and this copy is an extra cost in the whole treatment 2. You make your "BigFile" accessible by NFS or other Shared FS from Hadoop cluster Node. The first job in your treatment pipeline read the file and split it by record offset reference (Output1 : record from 0 to N , Ouput2 : N to M and so on...) 3. On each OuputX a Map task is launch in // which will treat file (still accessible through sharedFS) from reord N to M according to OutputX info - Reading about MapReduce it sounds to be a general framework able to split a "big job" into many smaller "sub-jobs", and have those "sub-jobs" executed concurrently (potentially on other different machines), all-in-all to complete the "big job". This could be used for many other things than "working with files", but then again examples and some of the descriptions makes it sound like it is all only about "jobs working with files". Is MapReduce only usefull/concerned with "jobs" related to "working with files" or is it more general-purpose so that it is usefull for any split-big-job-into-many-smaller-jobs-and-have-those-executed-in-parallel-problem? Response : Hadoop are not only specialised with (while i think it is 99% of its utilisation...). As a say before your input are accessible through InputFormat interface. - I believe we will end up having a HDFS over the disks on the machines/shards in secure zone. Is HDFS a "must have" for MapReduce to work at all? E.g. HDFS might be the way sub-jobs are distributed and/or persisted (so that they will not be forgotten i case of a shard breakdown or something). Response : Hadoop can work on other FS (Amazon S3 for example), or with other style of input (like NoSql Cassandra table), but i think there is a need for either a small HDFS to store the working space of running jobs. I think that most of usage rely on HDFS which take care of data localisation. The JobTracker launch the job on the node which hold the data in its local disk to avoid netwok exchange... - I think it sounds like an overhead to copy the big file (it will have to be deleted after succesful "loading") from the FTP server disk in DMZ to the HDFS in secure zone, just to be able to use MapReduce to distribute the work of "loading" it. We might want to do it in way so that each "sub-job" (of a "big job" about loading e.g. a big file big.txt) just points to big.txt together with from- and to- indexes into the file. Each "sub-job" will then have to only read the part of big.txt from from-index to to-index and "load" that. Will we be able to do something like that using MapReduce or is it all kind of "based on operating on files on the HDFS"? Response : I don't clearly understand all what you said but it sounds like to me not far from the solution we use and that i proposed to you in previous response. - Depending on the answer to the above question, we might want to be able to make the disk on the FTP server "join" the HDFS, in a way so that it is visible, but in a way so that data on it will not get copied in several copies (for redundancy matters) thoughout the disks on the shards (the "real" part of the HDFS) - remember the file will have to be deleted as soon as it has been "loaded". Is there such a concept/possibility of making "external" disk visible from HDFS, to enable MapReduce to work on files on such disks, without the files on such disks automatically will be copied to several different other disks (on the shards)? Response : Hadoop jobs are (generally) Java jobs so it is still possible to open file external to HDFS provides they could be accessed (through NFS or Other shared FS (Glouster FS, GPFS, etc)).. - As it understa
-
Re: From a newbie: Questions and will MapReduce fit our needsPeyman Mohajerian 2011-08-26, 15:47
Hi,
You should definitely take a look at Apache Sqoop as previously mentioned, if your file is large enough and you have several map jobs running and hitting your database concurrently, you will experience issues at the db level. In terms of speculative jobs (redundant jobs) running to deal with slow jobs, you have control over that in Hadoop. You can turn off speculative jobs or make sure when one job is finished the other one for the same input file is shutdown. Good Luck, On Fri, Aug 26, 2011 at 7:43 AM, MONTMORY Alain < [EMAIL PROTECTED]> wrote: > Hi, > > I am going to try to response to your response in the text. I am not an > hadoop expert but we are facing the same kind of problem (dealing with file > which are external to HDFS) in our project and we use hadoop. > > [@@THALES GROUP RESTRICTED@@] > > > -----Message d'origine----- > De : Per Steffensen [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>] > Envoyé : vendredi 26 août 2011 13:13 > À : [EMAIL PROTECTED] > Objet : From a newbie: Questions and will MapReduce fit our needs > > Hi > > We are considering to use MapReduce for a project. I am participating in > an "investigation"-phase where we try to reveal if we would benefit from > using the MapReduce framework. > > A little bit about the project: > We will be receiving data from the "outside world" in files via FTP. It > will be a mix of very small files (50 records/lines) and very big files > (5mio+ records/lines). The FTP server will be running in a DMZ where we > have no plans of using any Hadoop technology. For every file arriving > over FTP we will add a message (just pointing to that file) to a MQ also > running in DMZ - how we do that is not relevant for my questions here. > In the secure zone of our system we plan to run many machines (shards if > you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o. > to "load" (storing i db, indexing etc.) the files pointed to by the > messages they receive from the MQ. For resonably small files they will > probably just do the "loading" of the entire file themselves. For very > big files we would like to have more machines/shards, than the single > machine/shard that happens to receive the corresponding message, > participating in "loading" that particular file. > > Questions: > > - In general, do you think MapReduce will be beneficial for us to use? > Please remember that the files to be "loaded" does not live on a HDFS. > Any descriptions on why you would suggest that we use MapReduce will be > very velcome. > > Response : Yes because you could treat the "big file" in parallel and the > parallesisation done by hadoop is very effective. To treat your file you > need to have an InputFormat class which is able to read it. Here, two > solutions : > > 1. you copy your file inside the HDFS file system and you use > "FileInputFormat" (for text based file some are already produced by hadoop). > inconvenient the copy may be long…(in our case it is unacceptable) and this > copy is an extra cost in the whole treatment > > > > 1. You make your "BigFile" accessible by NFS or other Shared FS from > Hadoop cluster Node. The first job in your treatment pipeline read the file > and split it by record offset *reference* (Output1 : record from 0 to N > , Ouput2 : N to M and so on…) > > > > 1. On each OuputX a Map task is launch in // which will treat file > (still accessible through sharedFS) from reord N to M according to OutputX > info > > > - Reading about MapReduce it sounds to be a general framework able to > split a "big job" into many smaller "sub-jobs", and have those > "sub-jobs" executed concurrently (potentially on other different > machines), all-in-all to complete the "big job". This could be used for > many other things than "working with files", but then again examples and > some of the descriptions makes it sound like it is all only about "jobs > working with files". Is MapReduce only usefull/concerned with "jobs"
-
Re: From a newbie: Questions and will MapReduce fit our needsPer Steffensen 2011-08-29, 08:38
Hi
First of all thanks for your great response. I have a few additional comments and questions that I hope you will have a look at. Thanks! One additional question: Is Hadoop MapReduce at all production-ready? Are anyone using it in serious production? The main reason I ask is due to the version numbers (0.20 and 0.21), that doesnt make it sound like at production-ready tool. Regards, Per Steffensen MONTMORY Alain skrev: > Hi, > > I am going to try to response to your response in the text. I am not > an hadoop expert but we are facing the same kind of problem (dealing > with file which are external to HDFS) in our project and we use hadoop. > > [@@THALES GROUP RESTRICTED@@] > > > -----Message d'origine----- > De : Per Steffensen [mailto:[EMAIL PROTECTED]] > Envoy� : vendredi 26 ao�t 2011 13:13 > � : [EMAIL PROTECTED] > Objet : From a newbie: Questions and will MapReduce fit our needs > > Hi > > We are considering to use MapReduce for a project. I am participating in > an "investigation"-phase where we try to reveal if we would benefit from > using the MapReduce framework. > > A little bit about the project: > We will be receiving data from the "outside world" in files via FTP. It > will be a mix of very small files (50 records/lines) and very big files > (5mio+ records/lines). The FTP server will be running in a DMZ where we > have no plans of using any Hadoop technology. For every file arriving > over FTP we will add a message (just pointing to that file) to a MQ also > running in DMZ - how we do that is not relevant for my questions here. > In the secure zone of our system we plan to run many machines (shards if > you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o. > to "load" (storing i db, indexing etc.) the files pointed to by the > messages they receive from the MQ. For resonably small files they will > probably just do the "loading" of the entire file themselves. For very > big files we would like to have more machines/shards, than the single > machine/shard that happens to receive the corresponding message, > participating in "loading" that particular file. > > Questions: > > - In general, do you think MapReduce will be beneficial for us to use? > Please remember that the files to be "loaded" does not live on a HDFS. > Any descriptions on why you would suggest that we use MapReduce will be > very velcome. > > Response : Yes because you could treat the "big file" in parallel and > the parallesisation done by hadoop is very effective. To treat your > file you need to have an InputFormat class which is able to read it. > Here, two solutions : > > 1. you copy your file inside the HDFS file system and you use > "FileInputFormat" (for text based file some are already produced > by hadoop). inconvenient the copy may be long...(in our case it > is unacceptable) and this copy is an extra cost in the whole > treatment > This is what I would like to avoid. > > > 2. You make your "BigFile" accessible by NFS or other Shared FS > from Hadoop cluster Node. The first job in your treatment > pipeline read the file and split it by record offset *reference* > (Output1 : record from 0 to N , Ouput2 : N to M and so on...) > > > > 3. On each OuputX a Map task is launch in // which will treat file > (still accessible through sharedFS) from reord N to M according > to OutputX info > 2. and 3. is more what I would like to aim at, execpt that if the split-task needs to split correctly with respect to where a new record starts it needs to read the file. I would prefer that the split job just reads the size of the file and then splits it in X equally sized slices. The map-tasks will then need to be a little intelligent, e.g obeying a rule like "if my slice starts in the middle of a record, just assume that someone else is handling that record, and if my slice ends in the middle of a record, I will handle that record". But we can just make our own splitter that looks at the file (though NFS or SSHFS (as we prefer)) and does the split as we see fit (e.g. as explained above)? Kind of what I thought. With the way we plan to do it, non of the machines will have the data "locally", so the JobTracker can chose any machine - they are equally "efficient" with respect to "reading the file". Yes, you kind of already answered it. It is not necessary to copy the file from the FTP server to the HDFS to be able to work with it in MapReduce. We can just, in the split phase, look at the non-HDFS file and split that in our own splitter. And our own mapper can just read that non-HDFS file according to the information from the splitter. Yes Thanks. We need to keep that in mind. Yes I know, but anyway... Yes we kind of revealed by now that a map job only handling one line is to small a sub-job (it does not take at least a minute). My question was more about if the example is stupid. As I read the example each map job will only handle on line (there is no looping over many lines - there is only "String line = value.toString();") in Map.map-method, and that makes me think that the example is stupid (map-tasks are way to small). Guess that it is the TextInputFormat that does the splitting, and that it splits up in slices of only one line in size. Is it correct that TextInputFormat splits up in slices of only one line, and that each map-task will then only have to deal with one line, and isnt it correct that that is stupid (due to the fact that each map-task will we way to small). Or what did I miss? E.g. is the Map.map-method called may times for each slice/map-task? I will take a look at Apache Sqoop, but the part of loading data into the database is really not the hard part. And I am not sure Apache Sqoop will deal with our requirements - it will not be enough for us just to insert into a normal relational data, we need to insert into in some database suppo
-
Re: From a newbie: Questions and will MapReduce fit our needsPer Steffensen 2011-08-29, 08:48
Thanks for you reply.
Peyman Mohajerian skrev: > Hi, > > You should definitely take a look at Apache Sqoop as previously > mentioned, if your file is large enough and you have several map jobs > running and hitting your database concurrently, you will experience > issues at the db level. I believe several map jobs will not hit the same database concurrently - at least not to a very high degree - because I believe we will run one separate/isolated database on each machine. I guess it will be a SOLR/Lucene database on each machine, because we need to do full-text searches on some of the data, and that separate/isolated databases on each machine/shard it the way SOLR/Lucene scales over many machines to isolate index sizes. Only quering will involve all databases on all machines - inserting new datarecords will only involve the "local" database. But then again, I am curious about what Apache Sqoop can do to help with the problem you mention. What can a framework do about the problem that doing many concurrent inserts into the same database will eventually make the database a bottleneck. That is just a build-in problem, that I cannot see that any framework and help you with. But please enlighten me. > In terms of speculative jobs (redundant jobs) running to deal with > slow jobs, you have control over that in Hadoop. You can turn off > speculative jobs or make sure when one job is finished the other one > for the same input file is shutdown. Thanks, we will do that. > > Good Luck, > > On Fri, Aug 26, 2011 at 7:43 AM, MONTMORY Alain > <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > Hi, > > I am going to try to response to your response in the text. I am > not an hadoop expert but we are facing the same kind of problem > (dealing with file which are external to HDFS) in our project and > we use hadoop. > > [@@THALES GROUP RESTRICTED@@] > > > -----Message d'origine----- > De : Per Steffensen [mailto:[EMAIL PROTECTED]] > Envoy� : vendredi 26 ao�t 2011 13:13 > � : [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > Objet : From a newbie: Questions and will MapReduce fit our needs > > Hi > > We are considering to use MapReduce for a project. I am > participating in > an "investigation"-phase where we try to reveal if we would > benefit from > using the MapReduce framework. > > A little bit about the project: > We will be receiving data from the "outside world" in files via > FTP. It > will be a mix of very small files (50 records/lines) and very big > files > (5mio+ records/lines). The FTP server will be running in a DMZ > where we > have no plans of using any Hadoop technology. For every file arriving > over FTP we will add a message (just pointing to that file) to a > MQ also > running in DMZ - how we do that is not relevant for my questions > here. > In the secure zone of our system we plan to run many machines > (shards if > you like) a.o. being consumers on the MQ in DMZ. Their job will be > a.o. > to "load" (storing i db, indexing etc.) the files pointed to by the > messages they receive from the MQ. For resonably small files they > will > probably just do the "loading" of the entire file themselves. For > very > big files we would like to have more machines/shards, than the single > machine/shard that happens to receive the corresponding message, > participating in "loading" that particular file. > > Questions: > > - In general, do you think MapReduce will be beneficial for us to > use? > Please remember that the files to be "loaded" does not live on a > HDFS. > Any descriptions on why you would suggest that we use MapReduce > will be > very velcome. > > Response : Yes because you could treat the "big file" in parallel
-
Re: From a newbie: Questions and will MapReduce fit our needsPer Steffensen 2011-08-29, 09:04
Can you point me to at good place to read about Sqoop. I only find
http://incubator.apache.org/projects/sqoop.html and https://cwiki.apache.org/confluence/display/SQOOP. There is really not much to find, about what Sqoop can do, how to use it etc. Regards, Per Steffensen Peyman Mohajerian skrev: > Hi, > > You should definitely take a look at Apache Sqoop as previously > mentioned, if your file is large enough and you have several map jobs > running and hitting your database concurrently, you will experience > issues at the db level. > In terms of speculative jobs (redundant jobs) running to deal with > slow jobs, you have control over that in Hadoop. You can turn off > speculative jobs or make sure when one job is finished the other one > for the same input file is shutdown. > > Good Luck, > > On Fri, Aug 26, 2011 at 7:43 AM, MONTMORY Alain > <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > Hi, > > I am going to try to response to your response in the text. I am > not an hadoop expert but we are facing the same kind of problem > (dealing with file which are external to HDFS) in our project and > we use hadoop. > > [@@THALES GROUP RESTRICTED@@] > > > -----Message d'origine----- > De : Per Steffensen [mailto:[EMAIL PROTECTED]] > Envoy� : vendredi 26 ao�t 2011 13:13 > � : [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > Objet : From a newbie: Questions and will MapReduce fit our needs > > Hi > > We are considering to use MapReduce for a project. I am > participating in > an "investigation"-phase where we try to reveal if we would > benefit from > using the MapReduce framework. > > A little bit about the project: > We will be receiving data from the "outside world" in files via > FTP. It > will be a mix of very small files (50 records/lines) and very big > files > (5mio+ records/lines). The FTP server will be running in a DMZ > where we > have no plans of using any Hadoop technology. For every file arriving > over FTP we will add a message (just pointing to that file) to a > MQ also > running in DMZ - how we do that is not relevant for my questions > here. > In the secure zone of our system we plan to run many machines > (shards if > you like) a.o. being consumers on the MQ in DMZ. Their job will be > a.o. > to "load" (storing i db, indexing etc.) the files pointed to by the > messages they receive from the MQ. For resonably small files they > will > probably just do the "loading" of the entire file themselves. For > very > big files we would like to have more machines/shards, than the single > machine/shard that happens to receive the corresponding message, > participating in "loading" that particular file. > > Questions: > > - In general, do you think MapReduce will be beneficial for us to > use? > Please remember that the files to be "loaded" does not live on a > HDFS. > Any descriptions on why you would suggest that we use MapReduce > will be > very velcome. > > Response : Yes because you could treat the "big file" in parallel > and the parallesisation done by hadoop is very effective. To treat > your file you need to have an InputFormat class which is able to > read it. Here, two solutions : > > 1. you copy your file inside the HDFS file system and you use > "FileInputFormat" (for text based file some are already > produced by hadoop). inconvenient the copy may be long�(in > our case it is unacceptable) and this copy is an extra cost > in the whole treatment > > > > 2. You make your "BigFile" accessible by NFS or other Shared FS > from Hadoop cluster Node. The first job in your treatment > pipeline read the file and split it by record offset
-
Re: From a newbie: Questions and will MapReduce fit our needsarvind@...) 2011-08-29, 15:24
On Mon, Aug 29, 2011 at 2:04 AM, Per Steffensen <[EMAIL PROTECTED]> wrote:
> Can you point me to at good place to read about Sqoop. I only find > http://incubator.apache.org/projects/sqoop.html and > https://cwiki.apache.org/confluence/display/SQOOP. There is really not much > to find, about what Sqoop can do, how to use it etc. Please see the Sqoop user guide: http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html Thanks, Arvind > > Regards, Per Steffensen > > Peyman Mohajerian skrev: > > Hi, > > You should definitely take a look at Apache Sqoop as previously mentioned, > if your file is large enough and you have several map jobs running and > hitting your database concurrently, you will experience issues at the db > level. > In terms of speculative jobs (redundant jobs) running to deal with slow > jobs, you have control over that in Hadoop. You can turn off speculative > jobs or make sure when one job is finished the other one for the same input > file is shutdown. > > Good Luck, > > On Fri, Aug 26, 2011 at 7:43 AM, MONTMORY Alain > <[EMAIL PROTECTED]> wrote: >> >> Hi, >> >> I am going to try to response to your response in the text. I am not an >> hadoop expert but we are facing the same kind of problem (dealing with file >> which are external to HDFS) in our project and we use hadoop. >> >> [@@THALES GROUP RESTRICTED@@] >> >> >> -----Message d'origine----- >> De : Per Steffensen [mailto:[EMAIL PROTECTED]] >> Envoyé : vendredi 26 août 2011 13:13 >> À : [EMAIL PROTECTED] >> Objet : From a newbie: Questions and will MapReduce fit our needs >> >> Hi >> >> We are considering to use MapReduce for a project. I am participating in >> an "investigation"-phase where we try to reveal if we would benefit from >> using the MapReduce framework. >> >> A little bit about the project: >> We will be receiving data from the "outside world" in files via FTP. It >> will be a mix of very small files (50 records/lines) and very big files >> (5mio+ records/lines). The FTP server will be running in a DMZ where we >> have no plans of using any Hadoop technology. For every file arriving >> over FTP we will add a message (just pointing to that file) to a MQ also >> running in DMZ - how we do that is not relevant for my questions here. >> In the secure zone of our system we plan to run many machines (shards if >> you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o. >> to "load" (storing i db, indexing etc.) the files pointed to by the >> messages they receive from the MQ. For resonably small files they will >> probably just do the "loading" of the entire file themselves. For very >> big files we would like to have more machines/shards, than the single >> machine/shard that happens to receive the corresponding message, >> participating in "loading" that particular file. >> >> Questions: >> >> - In general, do you think MapReduce will be beneficial for us to use? >> Please remember that the files to be "loaded" does not live on a HDFS. >> Any descriptions on why you would suggest that we use MapReduce will be >> very velcome. >> >> Response : Yes because you could treat the "big file" in parallel and the >> parallesisation done by hadoop is very effective. To treat your file you >> need to have an InputFormat class which is able to read it. Here, two >> solutions : >> >> you copy your file inside the HDFS file system and you use >> "FileInputFormat" (for text based file some are already produced by hadoop). >> inconvenient the copy may be long…(in our case it is unacceptable) and this >> copy is an extra cost in the whole treatment >> >> >> >> You make your "BigFile" accessible by NFS or other Shared FS from Hadoop >> cluster Node. The first job in your treatment pipeline read the file and >> split it by record offset reference (Output1 : record from 0 to N , Ouput2 : >> N to M and so on…) >> >> >> >> On each OuputX a Map task is launch in // which will treat file (still >> accessible through sharedFS) from reord N to M according to OutputX info
-
RE: From a newbie: Questions and will MapReduce fit our needsMONTMORY Alain 2011-08-29, 18:12
Hi,
I have very enough time (milestone on my project..) so i respond very kickly to your questions, sorry !! See in the text... Good lucks! Regards, Alain [@@THALES GROUP RESTRICTED@@] De : Per Steffensen [mailto:[EMAIL PROTECTED]] Envoyé : lundi 29 août 2011 10:39 À : [EMAIL PROTECTED] Objet : Re: From a newbie: Questions and will MapReduce fit our needs Hi First of all thanks for your great response. I have a few additional comments and questions that I hope you will have a look at. Thanks! One additional question: Is Hadoop MapReduce at all production-ready? Are anyone using it in serious production? The main reason I ask is due to the version numbers (0.20 and 0.21), that doesnt make it sound like at production-ready tool. Response : don't be afraid by 0.XX numbering, Hadoop is already use in production by many customers! avoid 0.21 version take 0.20.xx. See http://wiki.apache.org/hadoop/PoweredBy Regards, Per Steffensen MONTMORY Alain skrev: Hi, I am going to try to response to your response in the text. I am not an hadoop expert but we are facing the same kind of problem (dealing with file which are external to HDFS) in our project and we use hadoop. [@@THALES GROUP RESTRICTED@@] -----Message d'origine----- De : Per Steffensen [mailto:[EMAIL PROTECTED]] Envoyé : vendredi 26 août 2011 13:13 À : [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Objet : From a newbie: Questions and will MapReduce fit our needs Hi We are considering to use MapReduce for a project. I am participating in an "investigation"-phase where we try to reveal if we would benefit from using the MapReduce framework. A little bit about the project: We will be receiving data from the "outside world" in files via FTP. It will be a mix of very small files (50 records/lines) and very big files (5mio+ records/lines). The FTP server will be running in a DMZ where we have no plans of using any Hadoop technology. For every file arriving over FTP we will add a message (just pointing to that file) to a MQ also running in DMZ - how we do that is not relevant for my questions here. In the secure zone of our system we plan to run many machines (shards if you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o. to "load" (storing i db, indexing etc.) the files pointed to by the messages they receive from the MQ. For resonably small files they will probably just do the "loading" of the entire file themselves. For very big files we would like to have more machines/shards, than the single machine/shard that happens to receive the corresponding message, participating in "loading" that particular file. Questions: - In general, do you think MapReduce will be beneficial for us to use? Please remember that the files to be "loaded" does not live on a HDFS. Any descriptions on why you would suggest that we use MapReduce will be very velcome. Response : Yes because you could treat the "big file" in parallel and the parallesisation done by hadoop is very effective. To treat your file you need to have an InputFormat class which is able to read it. Here, two solutions : 1. you copy your file inside the HDFS file system and you use "FileInputFormat" (for text based file some are already produced by hadoop). inconvenient the copy may be long...(in our case it is unacceptable) and this copy is an extra cost in the whole treatment This is what I would like to avoid. 2. You make your "BigFile" accessible by NFS or other Shared FS from Hadoop cluster Node. The first job in your treatment pipeline read the file and split it by record offset reference (Output1 : record from 0 to N , Ouput2 : N to M and so on...) 3. On each OuputX a Map task is launch in // which will treat file (still accessible through sharedFS) from reord N to M according to OutputX info 2. and 3. is more what I would like to aim at, execpt that if the split-task needs to split correctly with respect to where a new record starts it needs to read the file. I would prefer that the split job just reads the size of the file and then splits it in X equally sized slices. The map-tasks will then need to be a little intelligent, e.g obeying a rule like "if my slice starts in the middle of a record, just assume that someone else is handling that record, and if my slice ends in the middle of a record, I will handle that record". Response : Yes you can do that... - Reading about MapReduce it sounds to be a general framework able to split a "big job" into many smaller "sub-jobs", and have those "sub-jobs" executed concurrently (potentially on other different machines), all-in-all to complete the "big job". This could be used for many other things than "working with files", but then again examples and some of the descriptions makes it sound like it is all only about "jobs working with files". Is MapReduce only usefull/concerned with "jobs" related to "working with files" or is it more general-purpose so that it is usefull for any split-big-job-into-many-smaller-jobs-and-have-those-executed-in-parallel-problem? Response : Hadoop are not only specialised with (while i think it is 99% of its utilisation...). As a say before your input are accessible through InputFormat interface. But we can just make our own splitter that looks at the file (though NFS or SSHFS (as we prefer)) and does the split as we see fit (e.g. as explained above)? Response : Yes you can do that...RecordReader class are part of InputFormat, you should focuse on this point... - I believe we will end up having a HDFS over the disks on the machines/shards in secure zone. Is HDFS a "must have" for MapReduce to work at all? E.g. HDFS might be the way sub-jobs are distributed and/or persisted (so that they will not be forgotten i case of a shard breakdown or something). Response : Hadoop can work on other FS (Amazon S3 for example), or with other style of input (like NoSql Cassandra table), but i think there is a need for either a small HDFS to
-
Re: From a newbie: Questions and will MapReduce fit our needsPer Steffensen 2011-08-30, 06:41
Thanks for your response.
MONTMORY Alain skrev: > > Hi, > > > > I have very enough time (milestone on my project..) so i respond very > kickly to your questions, sorry !! > > See in the text... > > Good lucks! Regards, > > > > Alain > > > > [@@THALES GROUP RESTRICTED@@] > > > > *De :* Per Steffensen [mailto:[EMAIL PROTECTED]] > *Envoy� :* lundi 29 ao�t 2011 10:39 > *� :* [EMAIL PROTECTED] > *Objet :* Re: From a newbie: Questions and will MapReduce fit our needs > > > > Hi > > First of all thanks for your great response. I have a few additional > comments and questions that I hope you will have a look at. Thanks! > > One additional question: Is Hadoop MapReduce at all production-ready? > Are anyone using it in serious production? The main reason I ask is > due to the version numbers (0.20 and 0.21), that doesnt make it sound > like at production-ready tool. > > > > Response : don't be afraid by 0.XX numbering, Hadoop is already use in > production by many customers! avoid 0.21 version take 0.20.xx. See > *http://wiki.apache.org/hadoop/PoweredBy* > > > > Regards, Per Steffensen > > MONTMORY Alain skrev: > > Hi, > > > > I am going to try to response to your response in the text. I am not > an hadoop expert but we are facing the same kind of problem (dealing > with file which are external to HDFS) in our project and we use hadoop. > > > > [@@THALES GROUP RESTRICTED@@] > > > > > > -----Message d'origine----- > De : Per Steffensen [mailto:[EMAIL PROTECTED]] > Envoy� : vendredi 26 ao�t 2011 13:13 > � : [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > Objet : From a newbie: Questions and will MapReduce fit our needs > > > > Hi > > > > We are considering to use MapReduce for a project. I am participating in > > an "investigation"-phase where we try to reveal if we would benefit from > > using the MapReduce framework. > > > > A little bit about the project: > > We will be receiving data from the "outside world" in files via FTP. It > > will be a mix of very small files (50 records/lines) and very big files > > (5mio+ records/lines). The FTP server will be running in a DMZ where we > > have no plans of using any Hadoop technology. For every file arriving > > over FTP we will add a message (just pointing to that file) to a MQ also > > running in DMZ - how we do that is not relevant for my questions here. > > In the secure zone of our system we plan to run many machines (shards if > > you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o. > > to "load" (storing i db, indexing etc.) the files pointed to by the > > messages they receive from the MQ. For resonably small files they will > > probably just do the "loading" of the entire file themselves. For very > > big files we would like to have more machines/shards, than the single > > machine/shard that happens to receive the corresponding message, > > participating in "loading" that particular file. > > > > Questions: > > > > - In general, do you think MapReduce will be beneficial for us to use? > > Please remember that the files to be "loaded" does not live on a HDFS. > > Any descriptions on why you would suggest that we use MapReduce will be > > very velcome. > > > > Response : Yes because you could treat the "big file" in parallel and > the parallesisation done by hadoop is very effective. To treat your > file you need to have an InputFormat class which is able to read it. > Here, two solutions : > > 1. you copy your file inside the HDFS file system and you use > "FileInputFormat" (for text based file some are already produced by > hadoop). inconvenient the copy may be long...(in our case it is > unacceptable) and this copy is an extra cost in the whole treatment > > This is what I would like to avoid. > > > > 2. You make your "BigFile" accessible by NFS or other Shared FS from > Hadoop cluster Node. The first job in your treatment pipeline read the > file and split it by record offset *reference* (Output1 : record from Ahhh, ok. I missed the RecordReader part of InputFormat/split. I guess/expect that the RecordReader is called on the "map"-side (as opposed to the "split"-side) of the flow, so that not all information provided by the RecordReaader to the Mapper has to be transported from the node doing split to the node doing a concrete map. |