Sometimes, one needs to plot one's data grouped in some way. For instance, where daily data would be too cluttered; the same data, summed into weekly values would be more readable.
Consider the following data (here's the full data set):
Date WeekCommencing Measure
2014-01-07 2014-01-06 8
2014-01-10 2014-01-06 11
2014-01-10 2014-01-06 18
2014-01-16 2014-01-13 18
2014-01-16 2014-01-13 12
2014-01-17 2014-01-13 7
2014-01-17 2014-01-13 9
2014-01-23 2014-01-20 8
2014-01-24 2014-01-20 12
2014-01-29 2014-01-27 8
2014-02-03 2014-02-03 8
2014-02-07 2014-02-03 10
...
You might generate a daily-value plot of this using the following gnuplot commands:
reset
# Tab separated values
set datafile separator "\t"
# Ignore lines containing column headings
set datafile commentschars "DW"
# Don't include a key since there is only one range being plotted
set nokey
set title "Individual Measures"
set ylabel "Count"
# X-axis values are dates.
set xdata time
set timefmt "%Y-%m-%d"
# This controls how the x-axis tick labels are formatted.
set format x "%Y-%m-%d"
# The data labels are quite long, so write them vertically
# to prevent them from overlapping.
set xtics rotate by 90 offset 0,-4
# We need a larger space at the bottom of the graph to fit in the dates.
set bmargin 5
plot 'summing_columns.txt' using 1:3
This produces the following plot:
This is rather cluttered. Note that if we had drawn the values with impulses, the plot might be easier to read, but lower values (at the same x-position) would be hidden behind/under higher values. The clutter (and overlapping x-position values) can be avoided by summing the data for each week, using the "WeekCommencing" field, thus:
set output "summing_columns_weekly.png"
set title "Weekly Summed Measures"
set ylabel "Weekly Sum of Value"
plot '<awk "{W[$2]+=$3} END {for (w in W) print w\"\t\"W[w]}" summing_columns.txt | sort' using 1:2 with impulses
This generates the following plot:
This is much easier to read, without losing any information.
The trick here is to use the Awk program. Awk is a handy program designed for manipulating tabular data such as we have here (and such as we usually have when plotting). Awk for Windows is available from the GnuWin32 project at SourceForge. Don't be put off by the "32"; it works fine on 64 bit Windows.
The plotting command is not as complicated as it seems:
'<awk ...'
tells gnuplot to plot the
output of the awk command, as opposed to a plain file.$1
and $2
are Awk's shorthand for
column one and column two, respectively.C[$2]
refers to an array (hash) variable, indexed by the
value in the second column. In our case, this will be the
"WeekCommencing" value. This gives us an list of values, one for
each week.+=$3
keeps a running total by adding the values
(with the same "WeekCommencing" date) in column three.END
tells Awk that the code in the following curly
braces should be performed (once) at the end of the run. We use this
to print out the final totals.for (w in W)
performs a loop over each week
(w
) in the list (W
) of "WeekCommencing"
values.print w\"\t\"W[w]
prints two columns; the week
date, and the total value for that week. \t
prints a
tab character, creating two tab-separated columns. Note that the
\"
characters are necessary for the (Windows) command
line to properly interpret the tab character.| sort
is part of the overall command whose output
is plotted by gnuplot, and sorts the output of the Awk command into
chronological order. It isn't strictly necessary here, but Awk might
not produce its output in date order. Although this doesn't matter
for basic plots, it makes life easier if we use more sophisticated
plots (such as a Bezier curve; more on this below).If you want to plot a sort of average trend line for the weekly data, try the
following command; this is where the |sort
becomes necessary (try
it without the sort):
plot '<awk "{W[$2]+=$3} END {for (w in W) print w\"\t\"W[w]}" summing_columns.txt | sort' using 1:2 with impulses,\
'' using 1:2 smooth bezier
Note the smooth bezier
plotting style, which produces the
following:
The smooth
command can be used to produce a frequency
histogram (without actually using gnuplot's histogram feature). We do this by
using it with the frequency
option. In doing so, we need to change
the x-axis to use the third column values (the ones that we used for the y-axis
in the previous plots). For the y-axis, we use the numeric value of one. By
doing this, every time gnuplot encounters a given x-value, it adds one to the
tally for that value. In other words, it counts the number of times each x-axis
value appears. We specify a numeric value (as opposed to a column number) in the
using part, by placing brackets around it, hence 2:(1)
means
x=column 2, and y=number 1.
The other set
commands are there to change the settings left
over from the previous (time-based) plots. You could just use reset
set title "Distribution of Weekly Summed Values"
set xlabel "Summed Value"
set ylabel "Weekly Frequency"
# Turn off date/time data mode
set xdata
# Change x-axis labels to whole numbers
set format x "%.0f"
# Turn off the big margin and vertical writing for the x-axis
set bmargin 3
set xtics rotate by 0 offset 0,0
# impulses will be a bit thin, so use boxes instead
set boxwidth 0.5
set style fill solid 0.5
plot 'summing_columns.txt' using 3:(1) smooth frequency with boxes
Home | About Me | Copyright © Neil Carter |
Last updated: 2014-11-24