Peter Hall 2011-09-01, 23:52
Ken Krugler 2011-09-03, 20:49
We thought you wanted each mapper to run against an individual partition - if you want to just pull out a specific partition say the JAN2010 partition - you can achieve that in OraOop currently using a WHERE clause that would only pick that partition.
OraOop would then split the blocks of the entire table across all the mappers - the problem of course a lot of those blocks would not match the WHERE clause and hence not be selecting any rows. From my testing - Oracle seems to be smart enough that if you request rowid ranges and also specify a WHERE clause on the partition key such that no rows could match - it realises this and does not actually read the blocks. So OraOop would get maximum performance in this case (aside from the slight overhead to work out there were no blocks in that range - which shouldn't be much in the scheme of a big import job).
The main problem here is because OraOop is splitting across all the partitions - it could potentially be unbalanced and some mappers would process more rows than others. Are you experiencing problems with mappers being unbalanced? You can try changing the oraoop.block.allocation property to RANDOM (by default it is ROUNDROBIN).
Also Oracle should not spawn any parallel queries for OraOop queries - we have specifically disabled it as it would cause overhead on the database.
So in summary - to get the best possible performance you should run one OraOop job that fetches all the rows you want - and OraOop should balance the data across all the mappers and try to minimise the load on the DB by only reading each block once. The ROWIDs we scan also have the partition ID in them so we only ever scan one partition at a time - then UNION ALL the results together - as you have probably seen if you trace a session.
If there is a particular import job you are trying to tune you could send us some details about it - some screenshots from OEM and log files etc and we could have a look at it with you.
From: Ken Krugler [mailto:[EMAIL PROTECTED]]
Sent: Sunday, 4 September 2011 6:49 AM
To: [EMAIL PROTECTED]
Subject: Re: [sqoop-user] Support for partitioning during export into HDFS
On Sep 1, 2011, at 4:52pm, Peter Hall wrote:
We did initially consider an approach similar to what you suggest, but decided not to go with it due to complexities when the number of mappers is different to the number of partitions.
OK - but note that what I'm asking for is the ability to restrict a given Sqoop import request to one partition.
We can run multiple of these in parallel, if that would improve our throughput.
Given that we've got billions of rows coming from a single DB, we're looking to maximize performance here, thus the interest in this topic.
Instead we are breaking up the blocks in the table and spreading them across all the mappers and doing ROWID range scans. So all mappers could be reading from all partitions - but they would only be reading part of each.
Since the ROWID range scans are not partition specific, wouldn't this cause Oracle to spawn 16 parallel queries (one per partition)?
Also, typically wouldn't a range of ROWIDs be in one partition (or maybe two), if we have num mappers == num partitions?
So the queries in all the other partitions would match nothing.
Just wanting to make sure I understand what's happening under the hood, here.
Splitting by PARTITION may provide slightly better performance, but we don't believe it would be a huge difference.
For maximum performance when pulling data, it seems like we'd want to run multiple Sqoops in parallel against the available partitions in a table.
That would require adding 'PARTITION <partition_name> to the select statement, something like:
select * from <table_name> PARTITION <partition_name> where <condition>;
1. Does this make sense, both for general Sqoop and specifically OraOop?
2. Is there a way to do this now, or would Sqoop (and OraOop) need to be extended?
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr
Ken Krugler 2011-09-05, 23:41