I’ve been working with SSIS a lot and ran to a brick wall recently that involved the seeming conflict between load performance and handling l0ad-rejected rows on a row-by-row basis. Fortunately, I was able to design a solution for it and hopefully my experience can help you. Here is the situation I encountered:
I had a number of packages that each loads a SS2005 table from a (different) flat file. Part of the architecture for the system these packages belong to have a data quality exception process that logs each and every row that fails insertion (which subsequently triggers a report of the error).
An OLD DB Destination component handled the record insertions, which allowed me to redirect failing rows to an error file (instead of a SQL Server Destination, which does not).
So here is where the problem occurred. To make the OLE DB Destination perform at all efficiently, you usually set the data access mode to “fast load,” and set rows-per-batch and max-insert-commit-size parameters appropriately - which usually means numbers in the hundreds to the thousands range, depending on record width, etc.
That provides quick loading but poses a problem if you want to send single rejected rows to an error file because if a row fails, the entire batch it belongs to, say a 500 record batch, are all rolled back and sent to the error file. This means that in your error file you get 1 erroneous file and 499 valid records, which is obviously not what you want.
The obvious fix for this would be to set max insert commit to 1 (which means that batches of 1 record are committed at once) so that only the rejected rows get sent to the error file. But, of course, this creates a huge performance problem if you have large record sets (i.e., 100,000 rows); it can be nightmarishly slow.
Here’s how I solved the problem and maintained performance and single-row error handling. It’s really simple actually:
Create two stages of OLE DB Destination insertion. The first one is for performance and the second one is for row-level control. How does this work?
Send your rows to the first OLE DB Dest called “Attempt 1″ and set the max commit to something with good performance (500 rows or something). I set the batch to some multiple of that, like 1000.
Create a second OLE DB Dest called “Attempt 2″, and set the max commit to 1 and a batch equal to the max commit of Attempt 1. Then redirect all the error rows from Attempt 1 to Attempt 2. Route the error rows from Attempt 2 to your error file.
Source >> Transforms >> Attemp 1 >> Attempt 2 >> Error file (or other)
This works because: Assuming there are more valid rows in your source than invalid rows, your Attempt 1 will commit everything it can in large batches, which will perform very well. Say you have 100,000 records, committing at 500 at a time, you’re only doing 200 commits. Now, when an erroneous row is encountered, the 500 rows are sent to Attempt 2 - which commits one at a time, committing all of the valid rows in that batch, and redirecting the few invalid records. Let’s say you have 5 invalid rows in 5 different batches, then you’d send 5 * 500 = 2,500 rows to be committed one at a time - instead of 100,000.
To give you an idea of the kind performance improvement you can get using this method (versus committing all rows one at a time with a single OLE DB Destination), loading 1.6M rows with two attempts took 6.5 minutes, whereas committing one at a time took 3.5 HOURS. The two-attempt method is orders of magnitude faster and still allows me to capture errors on a row-by-row basis.
Happy plumbing!
Related posts:
Hello Ted, I received a post to one problem in msdn forum from you. Actually I’m reading your blogs; it’s very funny for me. So, do you have problems loading excel 2007 files? I have some problems in machine’s performance.
Hi Julian, thanks for reading the blog! I haven’t had the need to load Excel 2007 files yet. I do know when I tried to do it as a test, a driver for 2007 was not yet available. Have you tried to convert it to 2003 first and then loading it? If that works well, then it might have something to do with the XML format of .XLSX. Let me know how it goes.
Hi
Thanks for this article.
I have the same situation but would like to use the SQl server destination (instead of OLEDB destination).
The porblem is SQl server destination does not have an output path.
The reason for why I want to do this is that SS destination gives me th exact error reason(like Foreign key violation etc) while OLEDB destination doesnt.
Any thoughts on how to acheive this?
Thanks
kalyan
Kalyan,
I encountered the same problem with SQL Server Destination’s lack of output and, unfortunately, could not figure out how to overcome it. We are using the above method to load a data warehouse which does not have FK constraints defined (too much overhead) so we haven’t encountered your particular issue.
If you do find a work around, please let us know. It would be very useful.
Thanks!
Many thanks Ted, I’d been anguishing over this dilemma for ages since we started a project which will require loading large volumes of history data prior to the DW going live. It requires me to revisit all my tasks but hey it’s better than the alternative and it worked great for me.
Thanks again, Craig
Glad it was helpful, Craig. Let me know how it goes!