About Search Engine Optimization - Ranking - Placement

Placement Ranking Submission Optimization Postioning Resources Tips & Tools

About Search Engine Optimization - Ranking - Placement
Start a New Topic 
Author
Comment
Keep your data fit enough to survive

Keep your data fit enough to survive

As an agency amasses data, its IT architects are likely to find problems with consistency. Some data elements are formatted one way, others formatted differently. Some information becomes outdated but is never erased. Some is wrong and never corrected. It’s a headache that only grows worse as databases expand and are aggregated.

As the vice president of information quality at Firstlogic Inc. of La Crosse, Wis., Frank Dravis is something of a guru on these matters. FirstLogic sells software that analyzes and improves the quality of enterprise data. The company got its start working government contracts and still counts among its customers the Commerce, Homeland Security and Labor departments, as well as the General Services Administration, House of Representatives and Postal Service.

Dravis helps organizations work through their data quality problems. He is a member of the International Association of Information and Data Quality and writes a blog at weblogs.firstlogic.com/dravis. Dravis holds a bachelor’s degree in computer science from National University in San Diego and is currently pursuing a master’s in business at the University of Wisconsin. He spoke to GCN associate writer Joab Jackson by phone.

GCN: How do you define data quality?
Dravis: Data quality is fitness for use. It is how well your data supports your own business rules and operations.

GCN: How important is data formatting to sharing data?
Dravis: How can you expect an agency to share information internally or across other agencies if it doesn’t meet some common formatting standards? It won’t be immediately useful if it doesn’t meet some standard.

The information gets thrown over the fence, and the people who catch it have to put in place their own [extract, transform and loading] system, and that is [money spent on] a lot of nonvalue add. It is gumming up the whole information pipeline. The greater the formatting problems, then the greater likelihood you’re not going to go back to the source and ask for that information again.

GCN: What are some common formatting errors?
Dravis: Dates have common formatting errors. There are so many ways to enter dates. Are you using dashes, slashes or periods? If you are merging data together and one data source uses slashes and another uses periods, it can be confusing to people using the data. While they may be able to decipher the dates, sooner or later it slows the whole process down.

Part numbers are inherently problematic. Again, some people want to use slashes and dashes, but maybe over time they replace them with spaces. Then later, they concatenate the fields together wherever there is a space. All of a sudden, there are nine-character part numbers where there should be 10-character part numbers. They took the dashes out, replaced them with a spaces and then slammed them together. Classic stuff.

GCN: How did Firstlogic get started?
Dravis: We started with contracts with the Postal Service. We provided an address assignment technology that was loaded into multiline optical character recognition systems. That’s how I got my start here; I was a ZIP-plus-four assignment engineer. I wrote address assignment algorithms and matching algorithms. As the mail pieces flew by, the little camera took a picture of each envelope and sent it to [our software, which] deciphered the characters. It looked the addresses up in our address database and then supplied the bar code to spread on the mail piece. The mail piece could then go into the automation mail stream.

Now that was a data quality application. A lot of times the address would be slightly askew, or radically askew, and it didn’t match the address database. So you had to do some fuzzy matching logic to find out what was close, and once the confidence was above a certain threshold, you could say this is the real address. That was the genesis. This was 20 years ago. I remember when my boss came up to me and said, “Frank, we’re doing address cleansing, and we need to do name cleansing. It should be a short step.” So we developed a name-cleansing, standardization and formatting algorithm.

Addresses took us to names, names took us to matching, matching took us to consolidation. Wherever our customer had a data quality problem, they dragged us into that field. And so that is why our solution works on operational data.

GCN: And how has the technology evolved?
Dravis: Early on, customers would come to us and say they need an address-cleansing solution. We’d