Fast CSV Parsing With C#
When dealing with SQL Server Integration Services (SSIS), it is often difficult to add third party libraries to script tasks. It is also practically impossible to import CSV files with bad rows using the built in Flat File Data Source. So, I decided to write a custom script to import the file manually and filter out any rows with an invalid number of fields.
/// <summary>
/// Parse a single line in CSV format.
///
/// This method uses a character looping in lieu of String.Split for performance. You could
/// squeeze more performance out of it by using .NET 4 MemoryMappedFiles for large reads and
/// possibly pointer arithmatic instead of native C# looping. Using LinkedList may also help
/// for really large lines.
/// </summary>
/// <param name="line">The line to parse.</param>
/// <param name="delimiter">The delimiter that separates each value. Usually a comma.</param>
/// <param name="qualifier">The encapsulation character of a value. Usually a double quote.</param>
/// <returns>A list of strings.</returns>
public List<string> ParseCSV(string line, char delimiter, char qualifier)
{
// The character buffer length
int iBufferLength = 512;
// The return list
List<string> oReturn = new List<string>();
// Loop through each character
char cCurrent;
bool bEncapsulated = false;
char[] acBuffer = new char[iBufferLength];
int iBuffer = 0;
for (int i = 0; i < line.Length; i++) {
cCurrent = line[i];
if (cCurrent.Equals(qualifier)) {
bEncapsulated = !bEncapsulated;
continue;
}
if (cCurrent.Equals(delimiter) && !bEncapsulated || line.Length == (i + 1)) {
if (line.Length == (i + 1)) {
acBuffer[iBuffer] = cCurrent;
iBuffer++;
}
oReturn.Add(new string(acBuffer, 0, iBuffer));
Array.Clear(acBuffer, 0, iBufferLength);
iBuffer = 0;
continue;
}
acBuffer[iBuffer] = cCurrent;
iBuffer++;
}
return oReturn;
}
I’ve found that the typical approach to parsing a CSV file involves string splitting. While this method tends to be easiest, it is certainly not performant and usually doesn’t take into account encapsulated fields (fields surrounded by quotes because they have commas of their own). Instead, I chose to do things at a lower level and parse the line character by character into a buffer.
Please note that if you have fields with values longer than 512, this script will throw an IndexOutOfRangeException
! You can increase this value in the first line of the method’s body.
I only did some basic benchmarking on this code, but I found that it was on average 40% faster than the string split method when parsing 100,000 rows.
Feel free to use it however you’d like. If you have any suggestions on how to improve this code, please comment on this post or the Gist itself. I’d love to know your opinions!