gnuplot Time-Based Histograms

Histograms are handy for showing how multiple values combine to form a whole, sometimes cumulatively. Such data often comes in timed form; for instance, the multiplicity of values might come from incoming and outgoing traffic on a network hub, effort expended within various categories; whilst the time basis might be weekly amounts of the forementioned values (incoming vs. outgoing traffic per week)

Unfortunately, the current version of gnuplot does not support explicit x axis values for histograms. Instead, the x values are derived automatically from the line number in the data file. For instance, if you set xdata time and set style data histograms, you'll get the error message “Need full using spec for x time data”. This web page shows one way of solving this problem.

First, let's look at the original histogram way to see the problem. Consider the contents of the data file date_mins.tsv:

Date	Level_0_Mins	Level_1_Mins	Level_2_Mins
2011-01-08	30.34	22.58	161.08
2011-01-15	23.83	20.33	104.00
2011-01-22	50.50	16.17	79.75
2011-01-29	67.59	21.74	99.25
2011-02-05	37.58	33.33	155.33
2011-02-12	48.17	44.33	66.00
2011-02-19	89.34	12.42	91.42
2011-02-26	113.09	35.83	123.34
2011-04-02	174.25	105.25	221.25
2011-04-09	98.09	55.92	109.00
2011-04-16	98.67	30.83	202.00
2011-04-23	87.17	58.25	127.09
2011-04-30	139.74	67.33	232.84
2011-04-30	20.0	10.0	30.0

Notice that there are no entries for March (2011-03-??). Now, we probably expect any plot of this data to leave a gap where March's entries would have been. Also, there are two entries for 2011-04-30; this is intentional to demonstrate certain behaviours.

Histogram Plot

If we want to plot the data file above with the histograms plotting style, we can't specify that the x axis is time-based. Thus, gnuplot's histogram feature, because it uses line number (or, more specifically, data row index) rather than the actual x value (time-based or otherwise), doesn't detect any gap in the x values, so none is shown in the histogram plot below.

Even though gnuplot does not use the date values in the first column to determine the x value in the plot, we can still have the dates show up as x-labels. This is achieved by adding :xticlabels(1) after the y column number in the using properties.

Note, also, that gnuplot has to be told that the first line of the data file contains column headings, which can be used to label the plot. Moreover, the heading values (for each column) should not contain spaces, since that would also confuse gnuplot.


reset
clear

# If we don't use columnhead, the first line of the data file
# will confuse gnuplot, which will leave gaps in the plot.
set key top left outside horizontal autotitle columnhead

set xtics rotate by 90 offset 0,-5 out nomirror
set ytics out nomirror

set style fill solid border -1
# Make the histogram boxes half the width of their slots.
set boxwidth 0.5 relative

# Select histogram mode.
set style data histograms
# Select a row-stacked histogram.
set style histogram rowstacked

plot "date_mins.tsv" using 2:xticlabels(1) lc rgb 'green', \
	"" using 3 lc rgb 'yellow',  \
	"" using 4 lc rgb 'red'


Figure 1: Categorical Histogram of
       Weekly Data

Figure 1: Categorical Histogram of Weekly Data

Boxes Plot

Clearly, the plot above has no gap for March; gnuplot treats histogram 'bins' as being based on category data, not ordinal data (to be more accurate, the bins do have an ordering: their position in the file, but we might have expected an ordering based on date). There is, however, a workaround to the date-ordering problem, which uses the boxes (as opposed to histograms) plotting style. This is demonstrated in the example below.

Another special technique used in the following script is the mention of data values instead of data columns in the plot command. Immediately after the using keyword, we see 1:($2+$3+$4) The use of brackets and the dollar sign tells gnuplot that we mean an expression for it to evaluate, rather than a column number. So, in this command, we want to use column one for the x values, but for the y values, we want to sum the value of columns two, three, and four.


reset
clear

set key top left outside horizontal autotitle columnhead

set xtics rotate by 90 offset 0,-5 out nomirror
set ytics out nomirror

# This won't affect histogram plots since they just treat the
# dates in the first columns as literal strings.
set format x "%Y-%m-%d"

# Setting xdata to time precludes the use of histograms.
set xdata time
set timefmt "%Y-%m-%d"

set style fill solid border -1
# 1 week = 604,800 seconds.
# Make the box 50% of its slot.
set boxwidth 302400 absolute

# ($2+$3) is an expression meaning 'add the values in column 2 and
# column 3'; this is effectively the same as row-stacking.
# Data-series should be given in order of decreasing magnitude.
plot "date_mins.tsv" using 1:($2+$3+$4) with boxes lc rgb "red", \
	"" using 1:($2+$3) with boxes lc rgb "yellow", \
	"" using 1:2 with boxes lc rgb "green"

This produces the following plot; note that we now have a gap where March is. Also, the order of the items in the key is reversed.

Be careful to ensure that the (summed) columns are mentioned in order of decreasing size in the plot command. This is to avoid the fact that smaller boxes are obscured by larger ones, if the larger one is mentioned later in the plot command.


Figure 2: Time-Based Histogram of
       Weekly Data (using boxes)

Figure 2: Time-Based Histogram of Weekly Data (using boxes)

Note that the last bar has a small horizontal line near the bottom; this is caused by the fact that, now that gnuplot is treating the first column as time values, we have two rows of data for the same time (i.e. the same x position). Thus, it tries to draw the second duplicate row on top of the first one, and isn't entirely successful. Clearly, whilst histogram plots handle duplicate x values by creating new bars for them, they won't work when x values are treated as numerical (be they dates, times, or simple numbers).


Home About Me
Copyright © Neil Carter

Content last updated: 2012-02-28