Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> From a newbie: Questions and will MapReduce fit our needs


+
Per Steffensen 2011-08-26, 11:13
+
MONTMORY Alain 2011-08-26, 14:43
Copy link to this message
-
Re: From a newbie: Questions and will MapReduce fit our needs
Hi,

You should definitely take a  look at Apache Sqoop as previously mentioned,
if your file is large enough and you have several map jobs running and
hitting your database concurrently, you will experience issues at the db
level.
In terms of speculative jobs (redundant jobs) running to deal with slow
jobs, you have control over that in Hadoop. You can turn off speculative
jobs or make sure when one job is finished the other one for the same input
file is shutdown.

Good Luck,

On Fri, Aug 26, 2011 at 7:43 AM, MONTMORY Alain <
[EMAIL PROTECTED]> wrote:

>  Hi,
>
> I am going to try to response to your response in the text. I am not an
> hadoop expert but we are facing the same kind of problem (dealing with file
> which are external to HDFS) in our project and we use hadoop.
>
> [@@THALES GROUP RESTRICTED@@]
>
>
> -----Message d'origine-----
> De : Per Steffensen [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>]
> Envoyé : vendredi 26 août 2011 13:13
> À : [EMAIL PROTECTED]
> Objet : From a newbie: Questions and will MapReduce fit our needs
>
> Hi
>
> We are considering to use MapReduce for a project. I am participating in
> an "investigation"-phase where we try to reveal if we would benefit from
> using the MapReduce framework.
>
> A little bit about the project:
> We will be receiving data from the "outside world" in files via FTP. It
> will be a mix of very small files (50 records/lines) and very big files
> (5mio+ records/lines). The FTP server will be running in a DMZ where we
> have no plans of using any Hadoop technology. For every file arriving
> over FTP we will add a message (just pointing to that file) to a MQ also
> running in DMZ - how we do that is not relevant for my questions here.
> In the secure zone of our system we plan to run many machines (shards if
> you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o.
> to "load" (storing i db, indexing etc.) the files pointed to by the
> messages they receive from the MQ. For resonably small files they will
> probably just do the "loading" of the entire file themselves. For very
> big files we would like to have more machines/shards, than the single
> machine/shard that happens to receive the corresponding message,
> participating in "loading" that particular file.
>
> Questions:
>
> - In general, do you think MapReduce will be beneficial for us to use?
> Please remember that the files to be "loaded" does not live on a HDFS.
> Any descriptions on why you would suggest that we use MapReduce will be
> very velcome.
>
> Response : Yes because you could treat the "big file" in parallel and the
> parallesisation done by hadoop is very effective. To treat your file you
> need to have an InputFormat class which is able to read it. Here, two
> solutions :
>
>    1. you copy your file inside the HDFS file system and you use
>    "FileInputFormat" (for text based file some are already produced by hadoop).
>    inconvenient the copy may be long…(in our case it is unacceptable) and this
>    copy is an extra cost in the whole treatment
>
>
>
>    1. You make your "BigFile" accessible by NFS or other Shared FS from
>    Hadoop cluster Node. The first job in your treatment pipeline read the file
>    and split it by record offset *reference* (Output1 : record from 0 to N
>    , Ouput2 : N to M and so on…)
>
>
>
>    1. On each OuputX a Map task is launch in // which will treat file
>    (still accessible through sharedFS) from reord N to M according to OutputX
>    info
>
>
> - Reading about MapReduce it sounds to be a general framework able to
> split a "big job" into many smaller "sub-jobs", and have those
> "sub-jobs" executed concurrently (potentially on other different
> machines), all-in-all to complete the "big job". This could be used for
> many other things than "working with files", but then again examples and
> some of the descriptions makes it sound like it is all only about "jobs
> working with files". Is MapReduce only usefull/concerned with "jobs"
+
Per Steffensen 2011-08-29, 08:48
+
Per Steffensen 2011-08-29, 09:04
+
arvind@...) 2011-08-29, 15:24
+
Per Steffensen 2011-08-29, 08:38
+
MONTMORY Alain 2011-08-29, 18:12
+
Per Steffensen 2011-08-30, 06:41
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB