You might consider the way a humble and common line graph works to be obvious. If you learned to use them in the workplace, chances are that you think of line graphs as a conventional means to display one or more sets of time-series values. This is the graphical use of lines that typically bears the name “line graph.” Confusion is created, however, by the fact that lines can be used for various purposes in graphs. One common use is lines or curves of best fit, also known as trend lines or fit models. These lines summarize patterns in quantitative data sets rather than representing the actual values on which that summary is based. Lines are also used to represent multivariate data in the form of parallel coordinates plots, which work quite differently from regular line graphs. There is another graphical use of lines that you have been exposed to if you’ve studied mathematics beyond a basic level: equation graphs and function graphs. These graphs do not represent actual values that have been collected, but instead model mathematical equations and functions, representing visually what can also be expressed using mathematical notation. If you’ve been exposed extensively to this graphical use of lines, you might find regular lines graphs a bit confusing and misleading, despite their simplicity.
Every once in a while when teaching my Show Me the Numbers course about table and graph selection and design, a participant expresses concern that my use of line graphs is misleading. I’ll illustrate this situation with an example. The graph below is a simple, typically designed line graph that displays one set of time-series values.
This line graph displays only 10 quantitative values. The values are positioned as points along the line, one for each year. The line connects the 10 data points as straight segments from one data point to the next. For this reason, what I’m calling a typical line graph is sometimes referred to more specifically as a segmented line graph. The purpose of the line is to clearly display the pattern of change along the time series. The quantitative values are aggregates—in this case sums of profits—for each interval of time along the X axis. The slope of each line segment represents the nature of the change between each contiguous set of intervals (e.g., between 2009 and 2010). Nowhere between contiguous data points do any other values exist. Stated differently, no values can be read along a line apart from a single data point for each interval.
The objection that has been expressed in my classes a handful of times usually goes something like this: “That line graph is misleading. It claims that from 2009 to 2010 the profits decreased by a constant amount during that period of time, but there was probably a great deal of variation in daily profits.” This objection is usually based on a misunderstanding that stems from experience using lines to graph mathematical equations and functions. When a line is used for that purpose, we don’t usually call it a line graph. Instead, it is a “graph of an equation” or a “graph of a function,” among other potential terms. With graphs of equations or functions, every possible position along the line represents two values, one associated with the X axis and another associated with the Y axis. The scales along both axes are continuous quantitative scales. Given this mathematical use of lines in graphs, we can certainly understand why someone might find the way that a regular line graphs work for displaying time series to be confusing, resulting in the concern that they sometimes express. If you are one of those who shares this concern, just realize that lines are used for various purposes in graphs and that, in regular line graphs for displaying time series, they work differently than for mathematical equations and functions.
In a regular line graph of a time series, the scale along the X axis is what’s called an interval scale. It functions as a type of categorical scale that labels what the values represent. An interval scale begins as a range of quantitative values, which is then subdivided into a set of equally sized intervals, and a label is associated with each to identify it (e.g., >=20 & <30, >=30 & <40, etc.). Time is quantitative in nature. By this I mean that you can sum the number of hours, days, weeks, months, etc., to quantify the duration of time that has transpired between any two points in time. When we subdivide time into intervals of equal size, we express time as a categorical scale of an interval type. (Even though months are not all equal in size, when we display a times series by month, we treat it as if the months are equal when they are close enough in size to suit our purpose.) The intervals are not specific points in time, but are instead specific ranges of time. In a line graph that displays a single set of time-series values, one and only one value appears for each interval, which usually consists of an aggregate value per interval (most often a sum, and secondarily an average), but on occasions, rather than aggregates, the value for each interval is a measurement taken at a particular point in time (e.g., the closing price of a stock with daily intervals).
A great deal of confusion results from the many ways that we graphically represent data and the sloppy terms that we use to describe them. We would all benefit from a better understanding of the fundamentals and more clearly defined terms.