|
|
-
Processing 10MB files in Hadoop
Cubic 2009-11-26, 12:02
Hi list.
I have small files containing data that has to be processed. A file can be small, even down to 10MB (but it can me also 100-600MB large) and contains at least 30000 records to be processed. Processing one record can take 30 seconds to 2 minutes. My cluster is about 10 nodes. Each node has 16 cores.
Anybody can give an idea about how to deal with these small files? It is not quite a common Hadoop task; I know. For example, how many map tasks should I set in this case?
-
Re: Processing 10MB files in Hadoop
Siddu 2009-11-26, 13:42
On Thu, Nov 26, 2009 at 5:32 PM, Cubic <[EMAIL PROTECTED]> wrote:
> Hi list. > > I have small files containing data that has to be processed. A file > can be small, even down to 10MB (but it can me also 100-600MB large) > and contains at least 30000 records to be processed. > Processing one record can take 30 seconds to 2 minutes. My cluster is > about 10 nodes. Each node has 16 cores. > Sorry for deviating from the question , but curious to know what does core here refer to ? > Anybody can give an idea about how to deal with these small files? It > is not quite a common Hadoop task; I know. For example, how many map > tasks should I set in this case? >
-- Regards, ~Sid~ I have never met a man so ignorant that i couldn't learn something from him
-
Re: Processing 10MB files in Hadoop
Jeff Zhang 2009-11-26, 13:57
The number of mapper is determined by your InputFormat.
In common case, if file is smaller than one block size (which is 64M by default), one mapper for this file. if file is larger than one block size, hadoop will split this large file, and the number of mapper for this file will be ceiling ( (size of file)/(size of block) )
Jeff Zhang
On Thu, Nov 26, 2009 at 5:42 AM, Siddu <[EMAIL PROTECTED]> wrote:
> On Thu, Nov 26, 2009 at 5:32 PM, Cubic <[EMAIL PROTECTED]> wrote: > > > Hi list. > > > > I have small files containing data that has to be processed. A file > > can be small, even down to 10MB (but it can me also 100-600MB large) > > and contains at least 30000 records to be processed. > > Processing one record can take 30 seconds to 2 minutes. My cluster is > > about 10 nodes. Each node has 16 cores. > > > Sorry for deviating from the question , but curious to know what does core > here refer to ? > > > > Anybody can give an idea about how to deal with these small files? It > > is not quite a common Hadoop task; I know. For example, how many map > > tasks should I set in this case? > > > > > > -- > Regards, > ~Sid~ > I have never met a man so ignorant that i couldn't learn something from him >
-
Re: Processing 10MB files in Hadoop
CubicDesign 2009-11-26, 15:36
> Sorry for deviating from the question , but curious to know what does core > here refer to ? > http://en.wikipedia.org/wiki/Multi-core
-
Re: Processing 10MB files in Hadoop
CubicDesign 2009-11-26, 15:39
> The number of mapper is determined by your InputFormat. > > In common case, if file is smaller than one block size (which is 64M by > default), one mapper for this file. if file is larger than one block size, > hadoop will split this large file, and the number of mapper for this file > will be ceiling ( (size of file)/(size of block) ) > > Hi
Do you mean, I should set the number of map tasks to 1 ???? I want to process this file not in a single node but over the entire cluster. I need a lot of processing power in order to finish the job in hours instead of days.
-
Re: Processing 10MB files in Hadoop
Jeff Zhang 2009-11-26, 15:41
Actually, you do not need to set the number of map task, the InputFormat will compute it for you according your input data set.
Jeff Zhang On Thu, Nov 26, 2009 at 7:39 AM, CubicDesign <[EMAIL PROTECTED]> wrote:
> > The number of mapper is determined by your InputFormat. >> >> In common case, if file is smaller than one block size (which is 64M by >> default), one mapper for this file. if file is larger than one block size, >> hadoop will split this large file, and the number of mapper for this file >> will be ceiling ( (size of file)/(size of block) ) >> >> >> > Hi > > Do you mean, I should set the number of map tasks to 1 ???? > I want to process this file not in a single node but over the entire > cluster. I need a lot of processing power in order to finish the job in > hours instead of days. >
-
Re: Processing 10MB files in Hadoop
CubicDesign 2009-11-26, 15:58
But the documentation DO recommend to set it: http://wiki.apache.org/hadoop/HowManyMapsAndReducesPS: I am using streaming Jeff Zhang wrote: > Actually, you do not need to set the number of map task, the InputFormat > will compute it for you according your input data set. > > Jeff Zhang > > > On Thu, Nov 26, 2009 at 7:39 AM, CubicDesign <[EMAIL PROTECTED]> wrote: > > >> The number of mapper is determined by your InputFormat. >> >>> In common case, if file is smaller than one block size (which is 64M by >>> default), one mapper for this file. if file is larger than one block size, >>> hadoop will split this large file, and the number of mapper for this file >>> will be ceiling ( (size of file)/(size of block) ) >>> >>> >>> >>> >> Hi >> >> Do you mean, I should set the number of map tasks to 1 ???? >> I want to process this file not in a single node but over the entire >> cluster. I need a lot of processing power in order to finish the job in >> hours instead of days. >> >> > >
-
Re: Processing 10MB files in Hadoop
Jeff Zhang 2009-11-26, 16:05
Quote from the wiki doc *The number of map tasks can also be increased manually using the JobConf< http://wiki.apache.org/hadoop/JobConf>'sconf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.* So the number of map task is determited by InputFormat. But you can manually set the number of reducer task to improve the performance, because the default number of reducer task is 1 Jeff Zhang On Thu, Nov 26, 2009 at 7:58 AM, CubicDesign <[EMAIL PROTECTED]> wrote: > But the documentation DO recommend to set it: > http://wiki.apache.org/hadoop/HowManyMapsAndReduces> > > > PS: I am using streaming > > > > > Jeff Zhang wrote: > >> Actually, you do not need to set the number of map task, the InputFormat >> will compute it for you according your input data set. >> >> Jeff Zhang >> >> >> On Thu, Nov 26, 2009 at 7:39 AM, CubicDesign <[EMAIL PROTECTED]> >> wrote: >> >> >> >>> The number of mapper is determined by your InputFormat. >>> >>> >>>> In common case, if file is smaller than one block size (which is 64M by >>>> default), one mapper for this file. if file is larger than one block >>>> size, >>>> hadoop will split this large file, and the number of mapper for this >>>> file >>>> will be ceiling ( (size of file)/(size of block) ) >>>> >>>> >>>> >>>> >>>> >>> Hi >>> >>> Do you mean, I should set the number of map tasks to 1 ???? >>> I want to process this file not in a single node but over the entire >>> cluster. I need a lot of processing power in order to finish the job in >>> hours instead of days. >>> >>> >>> >> >> >> >
-
Re: Processing 10MB files in Hadoop
Jason Venner 2009-11-26, 18:14
Are the record processing steps bound by a local machine resource - cpu, disk io or other? What I often do when I have lots of small files to handle is use the NlineInputFormat, as data locality for the input files is a much lessor issue than short task run times in that case, Each line of my input file would be one of the small files, and then I would set the number of files per split to be some reasonable number. If the individual record processing is not bound by local resources you may wish to try the MultithreadedMapRunner, which gives you a lot of flexibily about the number of map executions you run in parallel without needing to restart your cluster to change the tasks per tracker. On Thu, Nov 26, 2009 at 8:05 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > Quote from the wiki doc > > *The number of map tasks can also be increased manually using the > JobConf< http://wiki.apache.org/hadoop/JobConf>'s> conf.setNumMapTasks(int num). This can be used to increase the number of > map > tasks, but will not set the number below that which Hadoop determines via > splitting the input data.* > > So the number of map task is determited by InputFormat. > But you can manually set the number of reducer task to improve the > performance, because the default number of reducer task is 1 > > > Jeff Zhang > > On Thu, Nov 26, 2009 at 7:58 AM, CubicDesign <[EMAIL PROTECTED]> > wrote: > > > But the documentation DO recommend to set it: > > http://wiki.apache.org/hadoop/HowManyMapsAndReduces> > > > > > > > PS: I am using streaming > > > > > > > > > > Jeff Zhang wrote: > > > >> Actually, you do not need to set the number of map task, the InputFormat > >> will compute it for you according your input data set. > >> > >> Jeff Zhang > >> > >> > >> On Thu, Nov 26, 2009 at 7:39 AM, CubicDesign <[EMAIL PROTECTED]> > >> wrote: > >> > >> > >> > >>> The number of mapper is determined by your InputFormat. > >>> > >>> > >>>> In common case, if file is smaller than one block size (which is 64M > by > >>>> default), one mapper for this file. if file is larger than one block > >>>> size, > >>>> hadoop will split this large file, and the number of mapper for this > >>>> file > >>>> will be ceiling ( (size of file)/(size of block) ) > >>>> > >>>> > >>>> > >>>> > >>>> > >>> Hi > >>> > >>> Do you mean, I should set the number of map tasks to 1 ???? > >>> I want to process this file not in a single node but over the entire > >>> cluster. I need a lot of processing power in order to finish the job in > >>> hours instead of days. > >>> > >>> > >>> > >> > >> > >> > > > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymallwww.prohadoopbook.com a community for Hadoop Professionals
-
Re: Processing 10MB files in Hadoop
Yongqiang He 2009-11-26, 19:35
Try CombineFileInputFormat.
Thanks Yongqiang On 11/26/09 4:02 AM, "Cubic" <[EMAIL PROTECTED]> wrote:
> i list. > > I have small files containing data that has to be processed. A file > can be small, even down to 10MB (but it can me also 100-600MB large) > and contains at least 30000 records to be processed. > Processing one record can take 30 seconds to 2 minutes. My cluster is > about 10 nodes. Each node has 16 cores. > > Anybody can give an idea about how to deal with these small files? It > is not quite a common Hadoop task; I know. For example, how many map > tasks should I set in this case? > >
-
Re: Processing 10MB files in Hadoop
CubicDesign 2009-11-26, 20:23
> Are the record processing steps bound by a local machine resource - cpu, > disk io or other? > Some disk I/O. Not so much compared with the CPU. Basically it is a CPU bound. This is why each machine has 16 cores. > What I often do when I have lots of small files to handle is use the > NlineInputFormat, Each file contains a complete/independent set of records. I cannot mix the data resulted from processing two different files. --------- Ok. I think I need to re-explain my problem :) While running jobs on these small files, the computation time was almost 5 times longer than expected. It looks like the job was affected by the number of map task that I have (100). I don't know which are the best parameters in my case (10MB files).
I have zero reduce tasks.
-
Re: Processing 10MB files in Hadoop
Aaron Kimball 2009-11-27, 19:02
More importantly: have you told Hadoop to use all your cores?
What is mapred.tasktracker.map.tasks.maximum set to? This defaults to 2. If you've got 16 cores/node, you should set this to at least 15--16 so that all your cores are being used. You may need to set this higher, like 20, to ensure that cores aren't being starved. Measure with ganglia or top to make sure your CPU utilization is up to where you're satisfied. (Note: this is a tasktracker setting, not a job setting. you'll need to set this on every node, then restart the mapreduce cluster to take effect.)
Of course, you need to have enough RAM to make sure that all these tasks can run concurrently without swapping. Swapping will destroy your performance. Then again, if you bought 16-way machines, presumably you didn't cheap out in that department :)
100 tasks is not an absurd number. For large data sets (e.g., TB scale), I have seen several tens of thousands of tasks.
In general, yes, running many tasks over small files is not a good fit for Hadoop, but 100 is not "many small files" -- you might see some sort of speed up by coalescing multiple files into a single task, but when you hear problems with processing many small files, folks are frequently referring to something like 10,000 files where each file is only a few MB, and the actual processing per record is extremely cheap. In cases like this, task startup times severely dominate actual computation time. If your individual records require around a minute each to process as you claimed earlier, you're nowhere near in danger of hitting that particular performance bottleneck.
- Aaron On Thu, Nov 26, 2009 at 12:23 PM, CubicDesign <[EMAIL PROTECTED]> wrote:
> > > Are the record processing steps bound by a local machine resource - cpu, >> disk io or other? >> >> > Some disk I/O. Not so much compared with the CPU. Basically it is a CPU > bound. This is why each machine has 16 cores. > > What I often do when I have lots of small files to handle is use the >> NlineInputFormat, >> > Each file contains a complete/independent set of records. I cannot mix the > data resulted from processing two different files. > > > --------- > Ok. I think I need to re-explain my problem :) > While running jobs on these small files, the computation time was almost 5 > times longer than expected. It looks like the job was affected by the number > of map task that I have (100). I don't know which are the best parameters in > my case (10MB files). > > I have zero reduce tasks. > > >
-
Re: Processing 10MB files in Hadoop
CubicDesign 2009-11-28, 00:07
Ok. I have set the number on maps to about 1760 (11 nodes * 16 cores/node * 10 as recommended by Hadoop documentation) and my job still takes several hours to run instead of one.
Can be the overhead added by Hadoop that big? I mean I have over 30000 small tasks (about one minute), each one starting its own JVM.
-
Re: Processing 10MB files in Hadoop
Patrick Angeles 2009-11-28, 00:17
What does the data look like?
You mention 30k records, is that for 10MB or for 600MB, or do you have a constant 30k records with vastly varying file sizes?
If the data is 10MB and you have 30k records, and it takes ~2 mins to process each record, I'd suggest using map to distribute the data across several reducers then do the actual processing on reduce.
On Fri, Nov 27, 2009 at 7:07 PM, CubicDesign <[EMAIL PROTECTED]> wrote:
> Ok. I have set the number on maps to about 1760 (11 nodes * 16 cores/node * > 10 as recommended by Hadoop documentation) and my job still takes several > hours to run instead of one. > > Can be the overhead added by Hadoop that big? I mean I have over 30000 > small tasks (about one minute), each one starting its own JVM. > > >
-
Re: Processing 10MB files in Hadoop
CubicDesign 2009-11-28, 00:24
30000 records in 10MB files. Files can vary and the number of records also can vary. > If the data is 10MB and you have 30k records, and it takes ~2 mins to > process each record, I'd suggest using map to distribute the data across > several reducers then do the actual processing on reduce. Hmmm... Good idea. Thanks. But is 'Reduce' optimized to do the heavy part of the computation?
-
Re: Processing 10MB files in Hadoop
CubicDesign 2009-11-28, 00:33
Aaron Kimball wrote: > (Note: this is a tasktracker setting, not a job setting. you'll need to set this on every > node, then restart the mapreduce cluster to take effect.) > Ok. And here is my mistake. I set this to 16 only on the main node not also on data nodes. Thanks a lot!!!!!! > Of course, you need to have enough RAM to make sure that all these tasks can > run concurrently without swapping. No problem!
> If your individual records require around a minute each to process as you claimed earlier, you're > nowhere near in danger of hitting that particular performance bottleneck. > > I was thinking that is I am under the recommended value of 64MB, Hadoop cannot properly calculate the number of tasks.
-
Re: Processing 10MB files in Hadoop
Aaron Kimball 2009-11-28, 02:21
By default you get at least one task per file; if any file is bigger than a block, then that file is broken up into N tasks where each is one block long. Not sure what you mean by "properly calculate" -- as long as you have more tasks than you have cores, then you'll definitely have work for every core to do; having more tasks with high granularity will also let nodes that get "small" tasks to complete many of them while other cores are stuck with the "heavier" tasks.
If you call setNumMapTasks() with a higher number of tasks than the InputFormat creates (via the algorithm above), then it should create additional tasks by dividing files up into smaller chunks (which may be sub-block-sized).
As for where you should run your computation.. I don't know that the "map" and "reduce" phases are really "optimized" for computation in any particular way. It's just a data motion thing. (At the end of the day, it's your code doing the processing on either side of the fence, which should dominate the execution time.) If you use an identity mapper with a pseudo-random key to spray the data into a bunch of reduce partitions, then you'll get a bunch of reducers each working on a hopefully-evenly-sized slice of the data. So the map tasks will quickly read from the original source data and forward the workload along to the reducers which do the actual heavy lifting. The cost of this approach is that you have to pay for the time taken to transfer the data from the mapper nodes to the reducer nodes and sort by key when it gets there. If you're only working with 600 MB of data, this is probably negligible. The advantages of doing your computation in the reducers is
1) You can directly control the number of reducer tasks and set this equal to the number of cores in your cluster. 2) You can tune your partitioning algorithm such that all reducers get roughly equal workload assignments, if there appears to be some sort of skew in the dataset.
The tradeoff is that you have to ship all the data to the reducers before computation starts, which sacrifices data locality and involves an "intermediate" data set of the same size as the input data set. If this is in the range of hundreds of GB or north, then this can be very time-consuming -- so it doesn't scale terribly well. Of course, by the time you've got several hundred GB of data to work with, your current workload imbalance issues should be moot anyway.
- Aaron On Fri, Nov 27, 2009 at 4:33 PM, CubicDesign <[EMAIL PROTECTED]> wrote:
> > > Aaron Kimball wrote: > >> (Note: this is a tasktracker setting, not a job setting. you'll need to >> set this on every >> node, then restart the mapreduce cluster to take effect.) >> >> > Ok. And here is my mistake. I set this to 16 only on the main node not also > on data nodes. Thanks a lot!!!!!! > > Of course, you need to have enough RAM to make sure that all these tasks >> can >> run concurrently without swapping. >> > No problem! > > > If your individual records require around a minute each to process as you >> claimed earlier, you're >> nowhere near in danger of hitting that particular performance bottleneck. >> >> >> > I was thinking that is I am under the recommended value of 64MB, Hadoop > cannot properly calculate the number of tasks. >
|
|