You are a leader. You have responsibilities to fulfill. Information, as they say, is power: power to do your job well. But your information is only as good as the data at hand… and data is hard. Deceptively hard. As William Pollard a famous physicist puts it;
Do you ask questions and wait forever for the answers? Does your organization have the data, but can’t seem to extract the information? Do you find it difficult to trust that data wholeheartedly? You are not alone.
You are also not alone if you don’t realize that this issue is, at least partially, your
responsibility. In today’s world, all leaders must be data managers, or pay the price in their organization’s success.
As your data guide, I will try to explain to you the basics of data management. I want to prepare you for success by helping you understand the magnitude and complexity of data management. Whether it’s big data or little data, a few basic truths can make you a little more data-wise.
Before we start, let’s define our terms.
A data system can be thought of in many ways. For this article, let’s assume we are talking about four main components:
- a computer program
- a data storage location
- the reporting engine
- the policies and procedures that determine its use
If this were a library, we would consider the entire library process to be a data system: the shelves, the book labels, and the procedures staff follow to administer the library. The books are the actual data of the system. The reason the entire system exists is to store, retrieve, organize, and steward books. The purpose of your organization’s data systems is the same.
1. Data is easier to destroy then to create.
This is the first thing you need to know as a manager of data. Imagine we are in a library. We’ve been wandering a fair bit looking for a good book and enjoying conversation with our friends. While searching for The Adventures of Huckleberry Finn, you accidentally trip and knock over a shelf, which dominos until the entire library is a mess of books and disorder. One simple act just caused a tremendous amount of effort to fix.
Organized data is fragile, and once you have disorganized it, the only way to put it together again is to handle every book, carefully returning it to the spot it belongs.
This is true of the digital world as well. You can easily—and permanently—destroy your organization’s library of data by careless or oblivious actions. In fact, decision makers and leaders can and do destroy libraries of digital data in just the same way as I described above. They are the catalyst of destruction, but are totally oblivious to this fact, because it’s not as visible as our library example. We’ll discuss how this can happen and tricks to avoid issues below.
That manager just didn’t know any better. Maybe you don’t know any better? After all, this is digital information, so it can be sorted, re-mixed, and reorganized, more quickly than those physical library books, right? Well, not always. In some cases, every record still has to be examined, and depending on the amount of data, it can take a really long time, even when it’s digital.
2. Without careful planning, new systems will corrupt your data.
If we continue to think of your organization’s data system as a library, and you are the library director, then what can you do to run an efficient operation? Well, you have a lot of control and power, actually. You can hire people to sort books, put books on shelves, and even check to see if books are in the wrong place. But something you should not do without great care is to change the system that organizes the books. Doing so would require you to reorganize the whole library. More importantly, it would require you to know a better system than the current one—something best left to experts and not done without careful consideration.
The key thing to understand is that introducing new systems can corrupt your data. Many people love to contemplate and explore new data systems, and you shouldn’t squash this exploration. But adoption is costly, and possibly destructive. It’s your job to protect the integrity of your data, and to explain to people the true cost of introducing new data systems into your environment.
There is nothing more destructive than two librarians with different filing systems working in the same library. Or a new filing system in a library that is abandoned halfway, leaving two systems in its wake. Because of this, everyone must learn to appreciate the power and potential costs of system change and competing systems. This dual-system point is so important, we’ll take a moment to explore it in more detail.
3. Running parallel systems is bad.
In an effort to seek middle ground and please many voices, managers can be tempted to
allow staff to use whatever systems are best for the job. After all, they know what’s best, don’t they? They probably do know what is best for their situation, but managers must take an enterprise view.
A manager must look at the whole and determine a healthy path for the entire organization and its data. As a result of compromise, many businesses have the same data in two systems: sometimes exactly the same, but more often, mostly the same. Like two librarians working in the same space with different methods, this can create complexities that can’t be resolved without great effort. The amount of effort required to fix this situation increases exponentially the longer it goes on.
The other end of the spectrum is to say that data systems can co-exist. Technology can be employed to easily share data from one system to another. It’s an attractive story, but can be a recipe for trouble, because exchanging information, like summiting a mountain, is easy to talk about and hard to do… which leads into our next point.
4. Moving data between two systems is really, really hard.
Imagine, in our library, that you have decided to convert all the books to a new organizing principle. You need to file every book differently now, a complicated and expensive task for sure. This is hard, but obviously not impossible—just a little elbow grease and proper management and you can reshuffle all your books around. The content of the books remains the same; maybe just the sticker on it is changed. So, this is a hard but not insurmountable conversion from one system to the other.
Now imagine you decide to convert all the books from novel to poem form. Well, now we have a problem. It would take ages to do this task. And to do it right? How do you translate the subtlety of the language? The humor? The riddles? The jokes? The Plot? No. It’s not possible to do this well; the best you can do is a best-effort translation. And it is something experts get degrees in.
This is the same in the digital world. Say, you decide to convert your lovingly written blog into tweets? You can do it, but most of your meaning and all of the art you put into the writing will be lost. You simply cannot translate vast amounts of information from one system to the other. Almost all systems speak different languages. From an exceptional amount of effort you will gain low fidelity of data.
You might point to interchange standards and ask, “What about XML? What about XSLT, and all those buzz-words that talk about exchanging information? Surely this problem is getting easier, as many people claim.” Well, indeed they do, and indeed these standards do help simplify the problem. But we are still very far away from a dependable, affordable solution, and it all depends on the complexity of data transferred.
In many cases the systems are what mathematicians call Orthogonal. This is a fancy way to say that one system cannot describe the other. For example, could you translate Chess to Checkers? It doesn’t make sense the games cannot be translated to each other. Such is the problem with computer and data systems. You can describe the rules of both of the games with XML. But you still cannot translate them into each other. How do you even approximate Chess to Checkers? Maybe there is a way, but it requires a degree of judgment and analysis, loss of fidelity that requires humans and management. It might always be a manual process. Such is the problem with system integration – it’s hard!
Of course, how your data is structured can greatly impact the costs and/or possibility of translation. Structure can make it impossible or possible to move information between systems.
5. Structure is your best friend or worst enemy.
Accounting has a brilliant dual-ledger system that structures data in such a way that you can find errors and identify problems much easier than before it was invented. The structure of modern-day accounting lends itself to fewer errors. Similarly, the structure of data is key to strong data for your organization.
How do we know how to structure data well? Luckily, a brilliant man working at IBM and
trained at Oxford compiled several rules to data structures that advanced the field remarkably. His name was Edgar Frank “Ted” Codd, and he recorded his rules in a 1970 paper A Relational Model of Data for Large Shared Data Banks. This work founded the field of data science, which many other brilliant people have continued.
It takes people years to understand the concepts and subtleties of data science and data normalization. The field is always changing, too, so it takes effort to stay current. As an organization’s leader, your job is to find someone that understands this magic art. You can’t wing it. You can’t feel your way through organizing your data any more than you can perform surgery without proper study.
If your data is bad, chances are your structure is bad. There is no amount of effort you can apply to fix bad data if a structure is broken. Much like the foundation of a house, if it’s right, you don’t have to think about it, but if it’s wrong, you can’t ever build anything that lasts on top of it. Remember, fix your structure first, and then train your staff to respect and curate it, but never to change the system – just as if you were running a library. Before we talk about introducing new data systems, let’s examine another option first: using someone else’s system. This option has many benefits to explore.
6. Use other people’s data whenever possible.
There is a wealth of data sources online and/or for purchase. Data.gov has an extensive
library of free information. Deciding to keep, curate, and maintain your own data system can be expensive, so many high-performing organizations tap into the data that’s already out there. Perhaps you want to use Twitter to record how many people mention your product, or use LinkedIn for information about the types of credentials your employees have.
It’s not economical to control all the data in the world, and why should you? Before you consider what data you need to control, always consider if you can leave it out of your control but access it reliably anyway. It’s a great way to save costs (although the data owner can always start charging you to use it). But, assuming that you cannot beg, borrow, or steal your data, and that you need to introduce a new system, what should you do?
7. How to consolidate, reduce or introduce new systems.
All of this discussion has focused on being careful: to avoid action and reduce changes to data systems, or simply sidestep the problem by sourcing data from somewhere else. Hopefully, by now you have a healthy fear of systems change. But now that you know the risks and have carefully weighed your options, it still may be appropriate to introduce a new data system. Perhaps you want to consolidate multiple systems into one, or introduce reporting across multiple systems. Every challenge you face is unique to you and requires analysis of your environment. You should hire a professional to help you with this, one with strong qualifications in data management.
Data scientists and other experts can guide your project to success. But you need to stay aware of the high cost of translation, and be prepared to hear that accurate translation may be simply impossible. You must be ready to lose fidelity of data. That’s why many people just start fresh with a new system, rather than attempt to transfer information from one to another; many times it’s cheaper and safer to start over. Never take that option off the table until you get the full costs. And never trust anyone to give you a full accounting of data-transforming costs except a data professional.
As a manager, you should strongly consider requiring a formal approval process before anyone introduces a new data system into your organization, because once it’s introduced, you can’t go back, and the worst case scenario is that you turn all your data to garbage. Let’s examine this problem in more detail by looking at the “garbage in, garbage out” principal.
8. The “Garbage In, Garbage Out” principle
Every computer scientist knows the “Garbage In, Garbage Out” principle. Simply put, you
can’t get intelligible information out of bad information. There is no magic way to organize bad data into good. Asking questions of garbage only yields more trash. But there is a subtlety to the GIGO principal that every manager should know: introducing garbage into a good system will create “garbage out.” Most data is only in a clean or dirty state: either you can trust it, or you can’t.
Much like poisoning pristine water well, if you inject bad data into a system, you have dirtied the system, and must figure out how to separate and extract the toxic elements. Sometimes it’s easier to just start over altogether, and sometimes information is lost forever. So what do you do once you realize your data is bad?
9. Identifying bad data is easy. Fixing it is expensive.
Do you ever wonder how your computer can know that it needs to run a disk check, but the check itself can take hours? That’s because knowing something is wrong is far less complex than finding it and fixing it. Data is curious; it is easy to ascertain an error, such as pulling up one personnel record out of 100 and seeing that a first and last name is switched.
You know there is a problem—but fixing it can take examining all 100 records. Do you know which are switched? Are all the last names and first names incorrect? Time to look at every record. Sometimes things are even more complicated, and you must compare each record to every other record. That’s 100×100, or 10,000 checks. In a big system, this can be nearly impossible. So your data sets can be thought of as ‘dirty’ or ‘clean.’ You can either trust their output, or you can’t. Once something is dirty, it can be irreversible, which is why a disk corruption is usually fixed by rolling the data back to a prior good working copy—a backup.
10. Working with Data
Once your data system is strong, there are a few basic things to know about how to go about using it. I’ve chosen three to highlight here.
a) Protecting your data can make it impossible to analyze.
The need for backup mentioned in #9 brings up a huge issue: how to safeguard and preserve the data you control. Although that is beyond the scope of this article, it does bring up one temptation to mention: “to keep the data safe, we’ll just encrypt it.” This is a double-edged sword.
When you encrypt data, your ability to ask questions of your data goes to zero. LinkedIn cannot open up their database and tell you how many people used the word ‘taco’ in their passwords; that data is encrypted for a good reason. It needs to stay out of the hands of hackers and thieves. So encryption also has a down side, which is that you can no longer analyze it. Be careful when you ask to encrypt, because you are also shutting out analysis down the road.
b) Know how much your questions cost to answer.
Most database administrators (DBA’s) will tell you they can answer almost any question you ask from data. But every question has many factors: the questions you are trying to answer, the acquisition and organization of the data, and the processing power required. You should always consider the first factor first, determining the question you need to answer is usually the cheapest part of the process, allows you to dream big, and informs all the other steps.
Here’s a process I recommend. Create a spreadsheet like this for the top 10 questions you want answered by your data.
As a manager, you job is to fill out the first column—what is the information you seek? Then have your team fill out the rest.
Once your team provides you that information, you’ll know the true cost to answering your questions. Maybe an answer would mean you have to introduce a new data collection system; maybe you have to use a supercomputer. Maybe you just run the question in your database and it takes 30 seconds. Whatever the case, you should know the cost to your organization to answer your top questions. Then you can decide what you can afford to answer and which questions are too expensive right now. Of course, you also want to consider the format of your reports, which leads to our final point.
c) Use all your senses to see, hear, and evaluate your data.
Some patterns are easier to see than others. Many people can read faster than they can listen, but you can’t write down a description of the sound of a crowd faster than you can hear it. You can identify chess patterns much easier by viewing the board then seeing a written description of the layout. Our senses perceive data differently and also perceive patterns differently. In order to get the most from your data, you should want to see it, hear it, and read it. Data is meant to be remixed and perceived from different angles. Open your mind to the possibilities of visualization options when thinking about your reports.
If you are a leader in a modern business today, then you are, in some way, a data steward. Let me be the first to welcome you to job duties that you may not even realize you had! Don’t worry—you’re not alone. Every manager deals with data; to be successful today means learning the right questions to ask and engaging the right help.
I hope this article prepares you to understand the complexity of the decisions you face as a data manager, and convinces you that you need help to build a strong data foundation. I hope you’ve gained the confidence you need to lead in a data-driven world, and face your data challenges head-on as an informed leader.