Confusion about Line Graphs
You might consider the way a humble and common line graph works to be obvious. If you learned to use them in the workplace, chances are that you think of line graphs as a conventional means to display one or more sets of time-series values. This is the graphical use of lines that typically bears the name “line graph.†Confusion is created, however, by the fact that lines can be used for various purposes in graphs. One common use is lines or curves of best fit, also known as trend lines or fit models. These lines summarize patterns in quantitative data sets rather than representing the actual values on which that summary is based. Lines are also used to represent multivariate data in the form of parallel coordinates plots, which work quite differently from regular line graphs. There is another graphical use of lines that you have been exposed to if you’ve studied mathematics beyond a basic level: equation graphs and function graphs. These graphs do not represent actual values that have been collected, but instead model mathematical equations and functions, representing visually what can also be expressed using mathematical notation. If you’ve been exposed extensively to this graphical use of lines, you might find regular lines graphs a bit confusing and misleading, despite their simplicity.
Every once in a while when teaching my Show Me the Numbers course about table and graph selection and design, a participant expresses concern that my use of line graphs is misleading. I’ll illustrate this situation with an example. The graph below is a simple, typically designed line graph that displays one set of time-series values.
This line graph displays only 10 quantitative values. The values are positioned as points along the line, one for each year. The line connects the 10 data points as straight segments from one data point to the next. For this reason, what I’m calling a typical line graph is sometimes referred to more specifically as a segmented line graph. The purpose of the line is to clearly display the pattern of change along the time series. The quantitative values are aggregates—in this case sums of profits—for each interval of time along the X axis. The slope of each line segment represents the nature of the change between each contiguous set of intervals (e.g., between 2009 and 2010). Nowhere between contiguous data points do any other values exist. Stated differently, no values can be read along a line apart from a single data point for each interval.
The objection that has been expressed in my classes a handful of times usually goes something like this: “That line graph is misleading. It claims that from 2009 to 2010 the profits decreased by a constant amount during that period of time, but there was probably a great deal of variation in daily profits.†This objection is usually based on a misunderstanding that stems from experience using lines to graph mathematical equations and functions. When a line is used for that purpose, we don’t usually call it a line graph. Instead, it is a “graph of an equation†or a “graph of a function,†among other potential terms. With graphs of equations or functions, every possible position along the line represents two values, one associated with the X axis and another associated with the Y axis. The scales along both axes are continuous quantitative scales. Given this mathematical use of lines in graphs, we can certainly understand why someone might find the way that a regular line graphs work for displaying time series to be confusing, resulting in the concern that they sometimes express. If you are one of those who shares this concern, just realize that lines are used for various purposes in graphs and that, in regular line graphs for displaying time series, they work differently than for mathematical equations and functions.
In a regular line graph of a time series, the scale along the X axis is what’s called an interval scale. It functions as a type of categorical scale that labels what the values represent. An interval scale begins as a range of quantitative values, which is then subdivided into a set of equally sized intervals, and a label is associated with each to identify it (e.g., >=20 & <30, >=30 & <40, etc.). Time is quantitative in nature. By this I mean that you can sum the number of hours, days, weeks, months, etc., to quantify the duration of time that has transpired between any two points in time. When we subdivide time into intervals of equal size, we express time as a categorical scale of an interval type. (Even though months are not all equal in size, when we display a times series by month, we treat it as if the months are equal when they are close enough in size to suit our purpose.) The intervals are not specific points in time, but are instead specific ranges of time. In a line graph that displays a single set of time-series values, one and only one value appears for each interval, which usually consists of an aggregate value per interval (most often a sum, and secondarily an average), but on occasions, rather than aggregates, the value for each interval is a measurement taken at a particular point in time (e.g., the closing price of a stock with daily intervals).
A great deal of confusion results from the many ways that we graphically represent data and the sloppy terms that we use to describe them. We would all benefit from a better understanding of the fundamentals and more clearly defined terms.
Take care,
11 Comments on “Confusion about Line Graphs”
I still struggle to see the value in ‘connecting the dots’ here. Wouldn’t a simple vertical bar be easier? Does the slope of the line between each dot really convey any more information than “hey, 2010 saw a sharp drop in profits, but we went back to normal in 2011”?
Nate,
It is actually quite easy to demonstrate the advantage of lines whenever you want to see patterns of change through time. To see this for yourself, access or create two sets of time-series values with twelve values each, one for each month of a year. A simple example would be monthly sales revenues for two regions. Next, display those values as a bar graph. The graph will show two side-by-side bars, one for each region, for each month. Next, pick one of the two data sets and try to see its pattern of change across the entire year. To do this, you must ignore the other set of bars and then imagine connecting the tops of the bars in the selected set from left to right. In effect, you must imagine a line connecting the tops of the bars. This is extremely difficult to do in your head, but the image that you’re trying to construct is precisely what you get with a line graph. To complicate the task further, try to compare the pattern of change for one set of values to the pattern formed by the other set of values. No matter how much you try, you can’t do this, because this requires the construction of two lines, which you must simultaneously hold in your head before you can compare the patterns. You can’t do this because of the severe limitations of working memory. Finally, create a line graph for the same two sets of time-series values and notice how easy it is to see and compare the patterns of change. A good rule of thumb is to always use a line graph for time series when you want to see patterns of change, which is almost always the case.
Would it help to make the data point markers stand out (e.g. larger, modify color, etc), to emphasize that they are the primary display of the original data, while the lines are secondary? And/Or to attach data labels at the points? Are there other ways to do this?
RH,
Once someone understands how to read a line graph, the data points are rarely needed at all. Making them larger creates visual clutter and reduces the precision of the values without providing any benefits to counterbalance those problems. Adding data labels is even worse, for it creates a great deal of clutter that makes it difficult to focus on the lines. Patterns of change through time that the lines reveal are not at all secondary to the data points. In fact, the primary beneift of a line graph of time-series values is to understand and compare patterns of change, not to read individual values. If the primary purpose were to read individual values, then a table of values would do the job better.
Any problem that people might have in understanding a line graph the first time that they’re exposed to one can be solved with a brief explanation. Let’s not try solve a problem that can be so easily overcome by degrading the usefulness of the graph with clutter and imprecision.
Stephen, that makes a lot of sense, especially the example of a case when you are plotting multiple time series on the same graph. I did exactly that in excel and it definitely was much easier to see a pattern of change when I had two time series, and it got even easier when I added 3 or 4.
Thanks for the responses Stephen. Has started me thinking of what an effective context would be for presenting even a simple graph like this, which can at least give viewers who might need it some clues for how to learn how to interpret it. (Can bring a horse to water but…)
Somewhere back in school I was taught that lines should be used for continuous series, and bars for discrete.
If following that rule the “year” series here would be a discrete series and should use a bar as Nate suggests – the discrete points of “year cannot be interpolated so the line is misleading.
Whenever I have come across the issue of trying to show a comparison of two series though, the line has always won out for exactly the reason Stephen gives.
I can’t shake that rule I was taught many years ago, and I’m always torn when there’s only one series to be displayed.
Jack,
The simple rule of thumb, “Lines for a continuous series and bars for a discrete series†is still taught in schools, which is unfortunate because the rule is neither thorough nor clearly explained. People often don’t understand the difference between continuous and discrete. Time is continuous. Dividing time into equal intervals (years, quarters, months, weeks, days, hours, etc.) does not change the nature of time from continuous to discrete. We aggregate time-series values by intervals because the summary view that this provides is easier to read and understand than an attempt to show every single value. For example, if you’re analyzing sales through time, for most purposes it would not be useful to show each individual sales transaction in a time-series graph, for it wouldn’t provide a clear view of how sales are changing through time overall.
When we place time on the X-axis of a graph, we usually divide it into intervals. Expressing time as a series of equally sized intervals does not change the continuous nature of time, it just provides a means for us to aggregate values per interval to get a better overall view of change through time. It makes sense to connect values along an interval scale with a line because there is an intimate connection from one value to the next and the pattern that’s formed by the line represents something meaningful: the pattern of change through time. Using bars for time series values only makes sense when you don’t want to see the pattern of change. For example, if you have two monthly series of values—sales and the budget for sales—and you only want to compare sales to the budget for individual months, one month at a time, bars support this well. The bars would incline you to see time as discrete chunks, one per month, rather than a continuous series.
When I teach rules of thumb, I’m careful to explain them thoroughly and clearly, to make sure that they are understood. Too often, books and courses merely present rules of thumbs, which people then follow blindly without understanding. This is my objection to many of the data visualization books and courses that exist today. They don’t promote understanding. They treat data visualization as a set of procedural skills (“Just do it this way”) rather than skills that require a conceptual-level understanding. As such, they don’t support true learning.
The (mis)application of learned rules of thumb is an interesting challenge for Information Design. My intuition tells me that it makes perfect sense to join up the dots in the line graphs shown above — and I think its because I want to see the trend.
I’m by no means a particularly intuitive thinker, but I’ve struggled many times with people who are ‘rules first, intuition second’, often with graphs and especially with grammar. For example, I happily split my infinitives, which drives people to such distraction that they cannot parse the sentence at all, leaving them both grumpy and without the (dubious) benefit of what I actually said.
Then at the far extreme, those who only have intuition devoid of rules and reason like bubble graphs and pie charts…
A little more general than the math-lines versus time-series-lines difference, I think it helps to realize that lines carry many pieces of information, and we can’t expect all of them to be relevant in a given instance. The power of the aspects that are relevant is usually worth the distraction from the others.
Some line information pieces that come to mind: connection, slope, positions of end points, interpolated values, continuity, sequence, distance, horizontal distance, vertical distance.
Steven,
I have seen time series line graphs fill in non-existent time interval values, even though they do not exist in the actual data. For example a time series graph sales by each day, will also plot days which which did not have any entry (no sales entry for that day) in the X axis to provide continuity. Is this a correct way of representation?