Readahead File Parsing in Powershell

Powershell has quickly grown into one of my favorate scripting languages because of its flexibility and near seamless integration with the .NET framework. I use it often to do simple tasks to which parsing simple text files is one.

Have you ever needed to look a few lines ahead in a file before writing it’s contents out- perhaps to determine what the file should be named? Then this is the article for you!

Concept

The concept is incredibly simple. We’re going to use a queue to temporary store data before writing them out.

Adding Flexibility With Parameters

We’ll be using some parameters to allow more flexibility when running the script:

[cmdletbinding()]
param (
    [String]
    [Parameter(Mandatory=$true)]
    [ValidateScript({Test-Path $_ -PathType Leaf})]
    $InputFile,

    [String]
    [Parameter(Mandatory=$true)]
    [ValidateScript({Test-Path $_ -PathType Container})]
    $OutputFolder
)

You can learn more about PowerShell parameters in this technet article. These parameters help make our script reusable, which is always a good thing, because you never know what might find it’s way into your work queue next!

The Basics

We’ll be creating a System.Collections.Queue object to store lines that we needed to read in advance, and a System.IO.StreamReader to actually read those lines:

$oQueue = New-Object System.Collections.Queue
$oFileStream = New-Object System.IO.StreamReader $InputFile

It may be a little mind bending at first as the usage of a queue adds an element of complexity, however it isn’t bad when broken down:

  1. We loop indefinately until the end of the input file is reached.
  2. Look through each line of the file for a special marker or formatting, processing the queue as necessary.
  3. Process as many lines as needed to obtain the information necessary, adding each line that was read to the queue.
  4. Add any lines not necessary to process to the queue for later.
  5. After our file has ended, we may still have information in the queue, and should be processed accordingly.

Let’s get to the example:

$sTitle = ""
$oOutputFile = $null

# Loop through each line of the file until we've reached the end
while (!$oFileStream.EndOfStream) {
    $sCurrentLine = $oFileStream.ReadLine()
    
    # We've found a new report, figure out a new title, and write the previous report
    if ($sCurrentLine.StartsWith("-- START OF REPORT --")) {
        # Make sure we have an open StreamWriter before dumping the queue.
        if ($oOutputFile -ne $null) {
            foreach ($line in $oQueue) {
                $oOutputFile.WriteLine($line)
            }

            # A foreach does not clear the queue, only iterates it.
            $oQueue.Clear()
        }

        $sNewTitle = ""
    
    	# Read ahead to the next line and set the title
        $oQueue.Enqueue($sCurrentLine)
        $sCurrentLine = $oInputFile.ReadLine()
        $sReportTitle = ($sCurrentLine -replace "TITLE:\s?([A-Za-z0-9]+)",'$1').Trim()
        
        # Read ahead to the next line and set the report ID
        $oQueue.Enqueue($sCurrentLine)
        $sCurrentLine = $oInputFile.ReadLine()
        $sReportId = $sCurrentLine -replace "ID:\s?([A-Za-z0-9]+)",'$1').Trim()
        
        # Put it all together
        $sNewTitle = "{0} - {1}" -f $sReportId, $sReportTitle

        # Close the old file if it exists
        if ($oOutputFile -ne $null) {
            $oOutputFile.Close()
        }

        # Create a new StreamWriter object
        $sFilename = $sNewTitle + ".txt"
        $oOutputFile = New-Object System.IO.StreamWriter (Join-Path $OutputFolder $sFilename)
    }

    $oQueue.Enqueue($sCurrentLine)
}

In this example, we are searching a file for a special marker, -- START OF REPORT --, looking for a report title, and ID. That information is then used to form a filename.

Notice that during this time, we also write our queue to a file if one exists. This makes the script more flexible as it allows for multiple reports in one file, but it also introduces another issue… What about the last report?!

Dealing With the Leftovers

To deal with those pesky leftovers, we must simply write out the rest:

if ($oOutputFile -ne $null) {
    foreach ($line in $oQueue) {
        $oOutputFile.WriteLine($line)
    }

    # A foreach does not clear the queue, only iterates it.
    $oQueue.Clear()
}

Finishing Up the Script

It’s always good to close any open resources:

$oOutputFile.Close()
$oFileStream.Close()

Other Reasons

This was just one example, but there are some other compelling reasons to queue up data before processing. One such example is to help minimize the IO impact of small reads and writes. By caching several lines of a file, writes to a hard drive can be done sequentially, making for smoother operation. It should be mentioned that the operating system probably does a better job doing this, so your mileage may vary, but it’s worth a try if performance is a priority.