<# Purpose: Extract useful information from a database exported as multiple text files Notes: A little background: One of my company's clients had requested a large bit of information from one of the State Departments, and they promptly exported a 7-tabled database as a series of 1.9 Million-line text files. Seven of them, to be exact. Luckily, we only need data from three of them. Our options were to attempt to re-assemble the database (fixing errant linefeeds in the process), or to get what we needed and get out. We chose the latter. Re-assembling a DB is just to unwieldy for standard user. We were able to locate the foreign key in each line of the text file, and split that out to create what you will find below. This is a script that was written with assistance from one of the Geniuses I have the pleasure of working with. I had a functional script, but the memory utilization was too high. It took some time to figure out what logic was necessary, but twice as long to get the script stripped down to what you see here. Caveat(s): No known issues. This was built for a very specific function, but you might be able to glean a bit of structural information from it. #> Set-Location "C:\DataDump\" $RecordHash = @{} $SplitString = ",`"" Get-Content (".\Data\Address.txt") -ReadCount 1000 | ForEach-Object{ ($_ -replace '(?ms)([^"])\r\n','$1') -match 'Columbia' -split [environment]::NewLine | ForEach-Object{ $BusinessRecord = [string]$_ -split $SplitString $CompanyID = $BusinessRecord.Split(',')[1] $RecordHash[$CompanyID] = [PSCustomObject]@{ CompanyID = $CompanyID CompanyName = $null Date = "No date on file" Street1 = $BusinessRecord[1].trim('"') Street2 = $BusinessRecord[2].trim('"') Street3 = $BusinessRecord[3].trim('"') City = $BusinessRecord[4].trim('"') State = $BusinessRecord[5].trim('"') ZIP = $BusinessRecord[6].trim('"') } } } Filter AddCompanyName{ if($RecordHash.containskey($_.Split(',')[1])){ $RecordHash[$_.Split(',')[1]].CompanyName = ($_.Split(',')[2]).trim('"') } } Get-Content ".\Data\CorporationName.txt" -ReadCount 1000 | ForEach-Object { $_ | AddCompanyName } Filter AddDate{ if($RecordHash.containskey($_.Split(',')[1])){ $RecordHash[$_.Split(',')[1]].Date = Get-Date ([DateTime](($_.Split(',')[5]).trim('"'))) -Format "yyyy-MM-dd" } } Get-Content ".\Data\Filing.txt" -ReadCount 1000 | ForEach-Object { $_ | AddDate } <# This will return an array of objects in $RecordHash, indexed by the companyID - which was the foreign key in the text files. In our situation, we just pushed it out as an HTML file for the client to view. #>
Collecting Data from a Database Dump
Thursday, July 24, 2014
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Please keep all comments clean!