Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Task tracker timeout with filtered table scan


Copy link to this message
-
Task tracker timeout with filtered table scan
I have a large table that I am running a map reduce job on. The job scans for a particular column value in the table using a TableInputFormat with a filter on the scan. This value only matches a few rows, so most of the rows are filtered out.

The problem is that the TableInputFormat  will not report status back to the task tracker until the regionserver sends back a row matching the filter. If there are only few matching rows, and the table is very large, it can take a while for a row to come back from the regionserver. This can result in a task tracker timeout. The problem is exacerbated with large region file sizes.

I can sort of work around this by increasing the mapred.task.timeout property, but that doesn't seem very optimal. The other solution would be to not use a filter, and to filter out rows in the map reduce job, which would increase I/O. Any other solutions? It seems the TableInputFormat shouldn't wait for the regionserver to report back status to the task tracker.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB