Collecting Data from a Database Dump

Thursday, July 24, 2014
<#
Purpose:    Extract useful information from a database exported as multiple text files
Notes:        A little background:  One of my company's clients had requested a large bit
            of information from one of the State Departments, and they promptly exported a
            7-tabled database as a series of 1.9 Million-line text files.  Seven of them, to
            be exact.  Luckily, we only need data from three of them.  Our options were to 
            attempt to re-assemble the database (fixing errant linefeeds in the process), or 
            to get what we needed and get out.  We chose the latter.  Re-assembling a DB is just
            to unwieldy for standard user.  We were able to locate the foreign key in each line
            of the text file, and split that out to create what you will find below.
            This is a script that was written with assistance from one of the Geniuses I
            have the pleasure of working with.  I had a functional script, but the memory
            utilization was too high.  It took some time to figure out what logic was necessary,
            but twice as long to get the script stripped down to what you see here.
Caveat(s):    No known issues.  This was built for a very specific function, but you might be able
            to glean a bit of structural information from it.
#>

Set-Location "C:\DataDump\"

$RecordHash = @{}
$SplitString = ",`""
Get-Content (".\Data\Address.txt") -ReadCount 1000 | ForEach-Object{
    ($_ -replace '(?ms)([^"])\r\n','$1') -match 'Columbia' -split [environment]::NewLine | ForEach-Object{
        $BusinessRecord = [string]$_ -split $SplitString
        $CompanyID = $BusinessRecord.Split(',')[1]
        $RecordHash[$CompanyID] = [PSCustomObject]@{
            CompanyID = $CompanyID
            CompanyName = $null
            Date = "No date on file"
            Street1 = $BusinessRecord[1].trim('"')
            Street2 = $BusinessRecord[2].trim('"')
            Street3 = $BusinessRecord[3].trim('"')
            City = $BusinessRecord[4].trim('"')
            State = $BusinessRecord[5].trim('"')
            ZIP = $BusinessRecord[6].trim('"')
            }
        } 
    }

Filter AddCompanyName{
    if($RecordHash.containskey($_.Split(',')[1])){
         $RecordHash[$_.Split(',')[1]].CompanyName = ($_.Split(',')[2]).trim('"')
        }
     }
Get-Content ".\Data\CorporationName.txt" -ReadCount 1000 | ForEach-Object {
     $_ | AddCompanyName
     }

Filter AddDate{
    if($RecordHash.containskey($_.Split(',')[1])){
     $RecordHash[$_.Split(',')[1]].Date = Get-Date ([DateTime](($_.Split(',')[5]).trim('"'))) -Format "yyyy-MM-dd"
     }
    }
Get-Content ".\Data\Filing.txt" -ReadCount 1000 | ForEach-Object {
     $_ | AddDate
     }
     
<#
    This will return an array of objects in $RecordHash, indexed by the companyID - which
    was the foreign key in the text files.  In our situation, we just pushed it out as an
    HTML file for the client to view.
#>

No comments:

Post a Comment

Please keep all comments clean!