How Dirty is Your Data?

Data-Cleaning

By Tony DePrato | Follow me on Twitter @tdeprato

My basic rule for data is, unless there is a life and death scenario unfolding, bad or unclean data is not going to be used. I have yet to encounter a situation where releasing data, which will eventually wreak havoc throughout the school, is an essential and lifesaving endeavor. Delaying systems access due to data issues is difficult. Even the smallest of systems have vocal advocates who will passionately state the damage being done to learning for every day a system is offline.

The best way to exist in a data-driven environment is to be prepared. Being prepared means being aware. Awareness comes from a regular, I would argue monthly, check of all core databases and having policies and procedures for correcting problems.

The real question is this: how does someone not involved in direct data management, check data? And how does someone who is an end user of data set policies to protect the data they need?

Validate and Verify

Anyone dealing with an IT manager, Technology Director, School Information System Specialist, or even a Business Manager should know about validation and verification.

When you validate, you are making certain the contents and format are correct. For example, if I ask you to type your name, and you instead you type your phone number, then you will not be validated. Your data is invalid.

When you verify, you are looking at data that is already in a system to see if it is correct and  in the correct place. For example, when you check a list of student names, you may find all middle names are part of the first name. If this was not by design, you could determine that the data failed verification.

When dealing with assets we often want to verify that what we ordered is what we received. When this type of questioning occurs, verification is happening.

Most people are hit with validation constantly while using the internet. Validation is ubiquitous. Websites often ask you to enter answers to questions, passwords, and CAPTCHA to validate your actions.

These two concepts,validation and verification, are the main tools needed to help the non-data managing people to engage and work with data managers.

Data Auditing

Many people will start auditing data by requesting a spreadsheet of data. This is a mistake.
Data without context is very difficult to understand.

The first step in auditing is simpe. Using questions, learn  how the data is validated:

  1. Where does a new record come from? (Paper + Manual Entry, Online form, Over the Phone + Manual Entry, Software running on computers at the school, etc.)
  2. How are errors prevented and checked?
  3. May I see the…form, paper, call script?
  4. May I do a sample and test the process?
  5. How do I know, after I complete the process, what data was collected?

I often find people are blown away by the time they get to step 5. They are either shocked at how amazing the system is, or appalled at the short comings. Since most data in schools comes from either families or is connected to student assessment, shortcomings do not sit well.

There is nothing to fix at this point. Even if there is a strong belief the system needs to be changed, change should always be data driven, and in this case, driven by the data quality. Until the data is actually reviewed, pause any immediate desire to change things.

The next step is to verify the data, and this can be done by requesting spreadsheets. If the school has a school information system (SIS) like PowerSchool, iSams, Blackbaud, etc. the first set of data needs to come from the SIS. This data should be the primary set used to create accounts in other systems.

Before asking for data, fields must be specified. For example, full name, date of birth, mother’s email address, etc. Be as specific as possible. When people are not specific, data managers take fields and manipulate them. You should be looking at raw data, not data that is filtered and/or edited.

When scanning the data from the SIS, after knowing how the data was collected, errors should start jumping out. If anything seems weird, make a note of it for further discussion. This process usually reveals patterns, such as, everyone having the same zipcode (yes that happened to me).

The secondary systems such as Moodle, Accelerated Reader, Discovery Streaming, etc. can have their data exported to be checked as well. These system often export a .csv file. Don’t worry. Excel, Numbers, and Libre Office can open .csv files. After the file is open, save it as Excel so that it is easier to work with.

Remember, it is not about being certain, it is about being suspicious and asking questions around those suspicions.

 

Leave a Reply

Your email address will not be published. Required fields are marked *