What are Data?
“Data” is one of those words we use constantly, often without pausing to ask what it actually means. Data are recorded observations: values, measurements, or descriptions collected about the world. They can be numbers, categories, images, text, sounds, or signals. What unites them is their role, not their form; they are the sources of raw material from which we try to learn.
A single temperature reading, a survey response, a lab result, or a line of transaction history are all data. They say very little on their own but once they are organized, analyzed and interpreted their meaning emerges.
The first, and undoubtedly the most important, principle of statistical data analysis is that, without knowing what type of data you have, you cannot choose an appropriate analysis.
So, what are the types of data? We will focus on statistical data. There are several ways to classify them, but the simplest classification is between categorical and quantitative data. Categorical data place observations into groups or classes. The possible values are labels or categories, and the frequency of those values can be counted. Data are categorical if observations can be put into a limited number of distinct “bins”. There are three subtypes:
Binary is the most basic; there are only two possible values. Some examples: Yes/No, Defective/Non-defective, Survive/Die, Accept/Reject, 0/1, Buy/Not Buy, For/Against
Nominal (meaning “named”) extends binary to more than two categories; however, the categories are unordered. Some examples: ethnicity, industry sector, marital status, ABO blood type, medical diagnosis.
Ordinal (meaning “ordered”) extends binary to more than two categories; however, now the categories are ordered. Some examples: highest level of education, gold/silver/bronze medals, 3-point scale of change (better, the same, worse), 5- or 7-point scales of agreement or satisfaction.
Quantitative data includes both discrete or continuous data. They can take any value in a range. They are easily characterized by having measurement units (although they are sometimes imaginary, such as measures of pain or satisfaction or happiness). They are also characterized by the involvement of some kind of measurement process such as a measuring instrument or questionnaire. And they can take any of a large number of possible values with little repetition of each value. For example: age (in years), height and weight, time duration, salary, return on investment, hemoglobin A1C.
Quantitative data can often be analyzed using arithmetic operations, while categorical data require classification and comparison.
Sometimes the type of data is clear and obvious, sometimes it is not. That’s because:
Quantitative data can be created by adding ordinal data together.
Data can be quantitative in theory, but categorical in practice. For example, household income is theoretically a dollar figure, but practically an income category.
Data can be expressed as more than one type. For example, age in years is quantitative, but age can be turned into categories.
Some “data” are for identification purposes only, rather than statistical analysis. For example: ID Number, Social Insurance Number, UPS Tracking Number. Some “data” are strings of text or non-numeric symbols. For example: words, times on a 12-hour or 24-hout clock). They cannot be analyzed statistically in those forms. They are neither categorical nor quantitative (even though they look like numbers). But they can be transformed into statistical data. For example, subtracting a birthdate from the current date gives age.
Here are some other ways of classifying data.
Structured vs. Unstructured
Structured data are organized in a predefined format, such as rows and columns in a spreadsheet or database. For example: a table of patient records. Unstructured data lack a fixed format. For example: emails, images, audio recordings, social media posts. Modern data science increasingly deals with unstructured data, which require different tools to process and interpret.
Observational vs. Experimental
Observational data are collected without intervention but by simply recording what happens. Experimental data arise from controlled studies where conditions are manipulated. This distinction is crucial for understanding causation versus association.
Cross-sectional vs. Longitudinal
Cross-sectional data are collected at one point in time. Longitudinal data are repeated observations on the same units over multiple time points. Time series data are a common special case of data observed over time.
Good Data vs. Bad Data
This may be the most important classification. Bad data limit what can be learned or may tell us absolutely nothing. They cannot be rescued by statistical analysis. Good data are informative. Data quality will be the subject of a future blog post.
Finally, what is the distinction between “data” and “information”? Data are raw observations. Information is what we obtain when data are processed, organized, and interpreted. Data are inputs; information is output.
The transformation from data to information requires context, analysis, and judgment. That requires knowledge of Statistics. In a world often described as “data-driven,” it is easy to assume that more data automatically lead to better decisions. But data alone are not enough. What matters is how they are collected, structured, analyzed, and interpreted. A vast quantity of poorly understood data can produce less insight than a small amount of well-analyzed data. It’s not the data themselves that matter, it’s what you do with them.
Postscripts
#1. The word ‘data’ is plural (the singular is ‘datum’). To be grammatically correct, write and say “data are” not “data is”. That is still the case for technical and scientific contexts and some formal writing. In computer science or digital technology, the word ‘data’ is now usually treated as a singular noun. But you will impress people who care about these things by using the plural form. We use the plural form.
#2. The word ‘data’ can be pronounced with a long A or a short A. Both are accepted.