gnuplot Frequency Plot

Let's say you've written a random number generator that produces numbers with a Gaussian (i.e. normal) distribution, and you want to visually check the distribution using a frequency histogram. This can be done with gnuplot using its support for user functions and the smooth frequency command. The technique used here comes from Professor David MacKay of Cambridge University (http://www.inference.phy.cam.ac.uk/teaching/comput/C++/).

The output of the random number generator is recorded in a text file as a single column of floating point numbers:

-0.355534673
0.13217762
-0.842048585
-0.131302223
0.03265043
0.190827265
-0.25680709
-0.072149448
-0.156086803
0.065009468
-0.003330263
-0.059023393
-0.017556788
-0.06090761
-0.205948548
-0.18360047
...

The gnuplot commands to plot this are shown below. The principle of operation is to plot not each value from the file, but to count the number of values that fall into specific ranges. The first thing we need to do is group the numbers into collections of similar value; this is done by the functions rounded() and bin_number().

rounded(x) takes the x parameter and rounds it to its nearest multiple of bin_width. This way, the fixed values returned can be counted into bins. However, the actual value returned by rounded() is first shifted up or down by an amount equal to the bin_width multiplied by the offset. Typically, the offset will be 0.5 since this would position the bar at the centre of its range on the X axis.

The bin_number() function uses bin_width to compute how many multiples the X parameter is of the range (and, therefore, the particular bin into which the X parameter should be counted. bin_number() does this by converting values into grouped values (the bins). For instance, with bin_width=0.1, values between 0.0 and 0.09 will be rounded down to 0, 0.10 ... 0.19 will be rounded down to 0.1, 0.20 to 0.29 will be 0.20, and so on.

Once we have arranged the disparate values of the data file into a selection of predefined values, we can then count the instances of those values, which we then plot. Since we don't want to plot the actual X values, but the number of them in certain bins, instead of the typical using command, we use gnuplot's smooth command, with the frequency option:

Note that the plot command includes brackets around the using values. The brackets tell gnuplot to interpret these values as numerical expressions, rather than column numbers (although a dollar followed by a number explictly references a column when inside brackets). Thus, (rounded($1)) means "pass the value in column one to the rounded() function, and use the value returned as the X-axis coordinate. And (1) means to increment the bin value (i.e. Y) by one for each corresponding X value.


clear
reset
set key off
set border 3

# Add a vertical dotted line at x=0 to show centre (mean) of distribution.
set yzeroaxis

# Each bar is half the (visual) width of its x-range.
set boxwidth 0.05 absolute
set style fill solid 1.0 noborder

bin_width = 0.1;

bin_number(x) = floor(x/bin_width)

rounded(x) = bin_width * ( bin_number(x) + 0.5 )

plot 'gaussian.txt' using (rounded($1)):(1) smooth frequency with boxes

This gives us the following result:

Figure 1: Frequency Plot for Random Function with Gaussian Distribution

Home

About Me

Content last updated: 2012-02-28