Get 1 Month Free Access to Generative AI Fundamentals on Coursera

View Only

It seems like I have come across alot of posts recently about visualizing univariate distributions. Besides my own recent blog post about comparing distributions of unequal size in SPSS, here are a few other blog posts I have recently come across;

- Nathan Yau has a recent post on his flowing data blog, How to Visualize and Compare Distributions.
- Lucas Fenu on his blog gRaphics! has a post, Plotting data and distribution simultaneously (with ggplot2).
- Not a blog post, but Hadley Wickham & Lisa Stryjewski have a working paper that will be of interest, 40 years of box plots.

Such a variety of references is not surprising though. Examining univariate distributions is a regular task for data analysis and can tell you alot about the nature of data (including potential errors in the data). Here are some posts on the Cross Validated Q/A site of related interest I have compiled;

- Wondering what this bean plot analysis chart means
- What are good data visualization techniques to compare distributions?
- How to improve standard visualizations of a univariate distribution?
- Analyze and visualize participants response towards particular condition
- How to scale violin plots for comparisons?

In particular the recent post on bean plots and Luca Fenu's post motivated my playing around with SPSS to produce the bean plots here. Note Jon Peck has published a graphboard template to generate violin plots for SPSS, but here I will show how to generate them in the usual GGRAPH commands. It is actually pretty easy, and here I extend the violin plots to include the beans suggested in bean plots!

A brief bit about the motivation for bean plots. Besides consulting the article by Peter Kampstra, one is interested in viewing a univariate continuous distribution among a set of different categories. To do this one uses a smoothed kernel density estimate of the distribution for each of the subgroups. When viewing the smoothed distribution though one loses the ability to identify patterns in the individual data points. Patterns can mean many things, such as outliers, or patterns such as striation within the main body of observations. The bean plot article gives an example where striation in measurements at specific inches can be seen. Another example might be examining the time of reported crime incidents (they will have bunches at the beginning of the hour, as well as 15, 30, & 45 minute marks).

Below I will go through a brief series of examples demonstrating how to make bean plots in SPSS.

First I will make some fake data for us to work with.

`******************************************.`

set seed = 10.

input program.

loop #i = 1 to 1000.

compute V1 = RV.NORM(0,1).

compute groups = TRUNC(RV.UNIFORM(0,5)).

end case.

end loop.

end file.

end input program.

dataset name sim.

execute.

value labels groups

0 'cat 0'

1 'cat 1'

2 'cat 2'

3 'cat 3'

4 'cat 4'.

******************************************.

Next, I will show some code to make the two plots below. These are typical kernel density estimates of the `V1`

variable I made for the entire distribution, and these are to show the elements of the base bean plots. Note the use of the `TRANS`

statement in the GPL to make a constant value to plot the rug of the distribution. Also note although such rugs are typically shown as bars, you could pretty much always use point markers as well in any situation where you use bars. Below the image is the GGRAPH code used to produce them.

`******************************************.`

*Regular density estimate with rug plot.

GGRAPH

/GRAPHDATASET NAME="graphdataset" VARIABLES=V1 MISSING=LISTWISE REPORTMISSING=NO

/GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

SOURCE: s=userSource(id("graphdataset"))

DATA: V1=col(source(s), name("V1"))

TRANS: rug = eval(-26)

GUIDE: axis(dim(1), label("V1"))

GUIDE: axis(dim(2), label("Density"))

SCALE: linear(dim(2), min(-30))

ELEMENT: interval(position(V1*rug), transparency.exterior(transparency."0.8"))

ELEMENT: line(position(density.kernel.epanechnikov(V1*1)))

END GPL.

*Density estimate with points instead of bars for rug.

GGRAPH

/GRAPHDATASET NAME="graphdataset" VARIABLES=V1 MISSING=LISTWISE REPORTMISSING=NO

/GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

SOURCE: s=userSource(id("graphdataset"))

DATA: V1=col(source(s), name("V1"))

TRANS: rug = eval(-15)

GUIDE: axis(dim(1), label("V1"))

GUIDE: axis(dim(2), label("Density"))

SCALE: linear(dim(2), min(-30))

ELEMENT: point(position(V1*rug), transparency.exterior(transparency."0.8"))

ELEMENT: line(position(density.kernel.epanechnikov(V1*1)))

END GPL.

******************************************.

Now bean plots are just the above plots rotatated 90 degrees, adding a reflection of the distribution (so the area of the density is represented in two dimensions), and then further paneled by another categorical variable. To do the reflection, one has to create a fake variable equal to the first variable used for the density estimate. But after that, it is just knowing alittle GGRAPH magic to make the plots.

`******************************************.`

compute V2 = V1.

varstocases

/make V from V1 V2

/index panel_dum.

GGRAPH

/GRAPHDATASET NAME="graphdataset" VARIABLES=V panel_dum groups MISSING=LISTWISE REPORTMISSING=NO

/GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

SOURCE: s=userSource(id("graphdataset"))

COORD: transpose(mirror(rect(dim(1,2))))

DATA: V=col(source(s), name("V"))

DATA: panel_dum=col(source(s), name("panel_dum"), unit.category())

DATA: groups=col(source(s), name("groups"), unit.category())

TRANS: zero = eval(10)

GUIDE: axis(dim(1), label("V1"))

GUIDE: axis(dim(2), null())

GUIDE: axis(dim(3), null())

SCALE: linear(dim(2), min(0))

ELEMENT: area(position(density.kernel.epanechnikov(V*1*panel_dum*1*groups)), transparency.exterior(transparency."1.0"), transparency.interior(transparency."0.4"),

color.interior(color.grey), color.exterior(color.grey)))

ELEMENT: interval(position(V*zero*panel_dum*1*groups), transparency.exterior(transparency."0.8"))

END GPL.

******************************************.

Note I did not label the density estimate anymore. I could have, but I would have had to essentially divide the density estimate by two, since I am showing it twice (which is possible, and if you wanted to show it you would omit the `GUIDE: axis(dim(2), null())`

command). But even without the axis they are still reasonable for relative comparisons. Also note the `COORD`

statement for how I get the panels to mirror each other (the transpose statement just switches the X and Y axis in the charts).

I just post hoc edited the chart to get it to look nice (in particular settign the spacing between the `panel_dum`

panel to zero and making the panel outlines transparent), but most of those things can likley be more steamlined by making an appropriate chart template. Two things I do not like, which I may need to edit the chart template to be able to accomplish anyway; 1) There is an artifact of a white line running down the density estimates, (it is hard to see with the rug, but closer inspection will show it), 2) I would prefer to have a box around all of the estimates and categories, but to prevent a streak running down the middle of the density estimates one needs to draw the panel boxes without borders. To see if I can accomplish these things will take further investigation.

This framework is easily extended to the case where you don't want a reflection of the same variable, but want to plot the continuous distribution estimate of a second variable. Below is an example, and here I have posted the syntax in entirety used in making this post. In there I also have an example of weighting groups inversely proportional to the total items in each group, which should make the area of each group equal.

In this example of comparing groups, I utilize dots instead of the bar rug, as I believe it provides more contrast between the two distributions. Also note in general I have not superimposed other summary statistics (some of the bean plots have quartile lines super-imposed). You could do this, but it gets a bit busy.

#datavisualization

#small-multiples

#SPSS

#SPSSStatistics

#Visualization

6 comments

6 views

Archive User

Tue January 27, 2015 01:56 AM

Thanks, Andrew, and also thanks for the link. Interesting insights on within-the-bar bias and overestimating differences with barchart error margins. Personally I liked the idea to show errors as gradually disappearing bars around statistic.

If they do a good job showing differences that would be masked with summary statistics or other graphs I think they have a good argument for there use. Also see Error Bars Considered Harmful: Exploring Alternate Encodings for Mean and Error (Correll & Gleicher, 2014), for some experimental evidence that they are more effective for showing statistical tests (and not just kde's of the original data).

Anton

Hi Louise,

In theory, something like `ELEMENT: point(position(summary.median(V*1*panel_dum*1*groups)), color.exterior(panel_dum), shape("Median"), size(size.large))`

should work, although my quick attempts to get it to act as desired were unsucessful. I also tried to make an actual summary variable via the `TRANS`

command in inline GPL and that did not work either. So what I ended up doing was making a new variable and plotting that in its own element statement. Note when you have multiple elements like this the legend gets a bit un-wieldy, and what I have been doing is editing post-hoc in a different vector editor to get the legend to how I want it. Part of the problem I think is that SPSS does not like mapping the same aesthetic to different types of elements, so sometimes it gives errors when trying to construct the legend.

Below is an example using the same data that is in the last plot in my blog post.

AGGREGATE

/OUTFILE=* MODE=ADDVARIABLES

/BREAK=groups panel_dum

/V_median=MEDIAN(V).

GGRAPH

/GRAPHDATASET NAME="graphdataset" VARIABLES=V V_median panel_dum groups MISSING=LISTWISE REPORTMISSING=NO

/GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

SOURCE: s=userSource(id("graphdataset"))

COORD: transpose(mirror(rect(dim(1,2))))

DATA: V=col(source(s), name("V"))

DATA: V_median=col(source(s), name("V_median"))

DATA: panel_dum=col(source(s), name("panel_dum"), unit.category())

DATA: groups=col(source(s), name("groups"), unit.category())

TRANS: zero = eval(20)

TRANS: med_point = eval(40)

GUIDE: axis(dim(1), label("V1"))

GUIDE: axis(dim(2), label("Frequency"))

GUIDE: axis(dim(3), null())

GUIDE: legend(aesthetic(aesthetic.color.exterior))

GUIDE: legend(aesthetic(aesthetic.color.interior))

GUIDE: legend(aesthetic(aesthetic.shape.interior), null())

GUIDE: legend(aesthetic(aesthetic.transparency.exterior), null())

GUIDE: legend(aesthetic(aesthetic.transparency.interior), null())

GUIDE: legend(aesthetic(aesthetic.size), null())

SCALE: linear(dim(2), min(0))

SCALE: cat(aesthetic(aesthetic.shape.interior), map(("Median", shape.square), ("Rug", shape.circle)))

SCALE: cat(aesthetic(aesthetic.transparency), map(("Median", transparency."0.0"), ("Rug", transparency."0.9")))

SCALE: cat(aesthetic(aesthetic.size), map(("Median", size."8"), ("Rug", size."4")))

ELEMENT: point(position(V*zero*panel_dum*1*groups), color.exterior(panel_dum), color.interior(panel_dum), transparency.interior("Rug"), transparency.exterior("Rug"),

shape.interior("Rug"), size("Rug"))

ELEMENT: point(position(V_median*zero*panel_dum*1*groups), color.exterior(panel_dum), color.interior(panel_dum), transparency.interior("Median"), transparency.exterior("Median"),

shape.interior("Median"), size("Median"))

ELEMENT: area(position(density.kernel.epanechnikov(V*1*panel_dum*1*groups)), transparency.exterior(transparency."1.0")), color(panel_dum), transparency.interior(transparency."0.5")))

END GPL.

Thanks for the comment and if you have any other questions let me know here in the comments or feel free to shoot me an email.

Andy

Thanks for posting the syntax - it was really useful to follow the steps you took to make these. I'm just wondering how you could add a coloured point to distinguish the median in each group- would it be something like this:

ELEMENT: point(position(summary.mean(V*1*panel_dum*1*groups))?

thanks in advance for your help!

louise