|
|
-
Task tracker timeout with filtered table scan
Bryan Keller 2012-05-31, 16:27
I have a large table that I am running a map reduce job on. The job scans for a particular column value in the table using a TableInputFormat with a filter on the scan. This value only matches a few rows, so most of the rows are filtered out.
The problem is that the TableInputFormat will not report status back to the task tracker until the regionserver sends back a row matching the filter. If there are only few matching rows, and the table is very large, it can take a while for a row to come back from the regionserver. This can result in a task tracker timeout. The problem is exacerbated with large region file sizes.
I can sort of work around this by increasing the mapred.task.timeout property, but that doesn't seem very optimal. The other solution would be to not use a filter, and to filter out rows in the map reduce job, which would increase I/O. Any other solutions? It seems the TableInputFormat shouldn't wait for the regionserver to report back status to the task tracker.
-
Re: Task tracker timeout with filtered table scan
Harsh J 2012-06-10, 19:10
Hey Bryan,
What value of scanner caching did you run this with? Could you try it with low values of 1-5?
On Thu, May 31, 2012 at 9:57 PM, Bryan Keller <[EMAIL PROTECTED]> wrote: > I have a large table that I am running a map reduce job on. The job scans for a particular column value in the table using a TableInputFormat with a filter on the scan. This value only matches a few rows, so most of the rows are filtered out. > > The problem is that the TableInputFormat will not report status back to the task tracker until the regionserver sends back a row matching the filter. If there are only few matching rows, and the table is very large, it can take a while for a row to come back from the regionserver. This can result in a task tracker timeout. The problem is exacerbated with large region file sizes. > > I can sort of work around this by increasing the mapred.task.timeout property, but that doesn't seem very optimal. The other solution would be to not use a filter, and to filter out rows in the map reduce job, which would increase I/O. Any other solutions? It seems the TableInputFormat shouldn't wait for the regionserver to report back status to the task tracker.
-- Harsh J
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext