How to R: Making Better Histograms
The story you tell with your data visualizations is only as good as the visualizations themselves. In this article we will look in ways to make histograms more insightful to your audience.
In this article, we will be utilizing the tidytuesday Horror Movies dataset. This can be pulled into R with the following code. We will be creating histograms based off of the vote_average column, and since there are a lot of rows with a vote_count of 0 we will remove these to keep from skewing the average values.
To begin, let’s start out with a basic histogram; exactly as it comes without any formatting whatsoever.
As we can see the distribution here is semi normal, with peaks at certain points such as the 10 mark. There are a couple of things we can add to make this histogram more easily understood by our audience.
1. Bin widths
We want to make sure the bin widths are easily understood by the audience, in this case we have votes on a discrete 0–10 scale. For now the bin widths are pretty difficult to tell how far they travel, my guess is a little less than 0.5 but the fact that it is hard to tell does not make it very interpretable for others. Let’s change the bin widths to be 1 wide, so that there should be a total of 10 bins.
Here it is more clear how the votes are distributed in relation to how votes are made for each movie. Someone can only vote for a 1 rating, or a 9 rating, but they can’t vote for a 3.45 rating.
What I find interesting right off the bat from this new binning, that I did not notice before was that after a 7 rating the 8 and 9 ratings are quite low and do not decrease incrementally in the same way as 1, 2, and 3 increased. This makes me wonder if when people are voting positively for a movie, if they choose to go straight to a 10 rating instead of an 8 or 9.
2. Data Labels
One thing you want to make sure to do is make it as easy as possible for your audience to grasp the size of each bin, not just the relative sizes each bin has to one another. For now, if you present this to someone they will have to use the y-axis to identify how many movies have a 5 rating. This can be done but it does make your audience have to take time to get that information. To make it clear and easy to read we can add data labels to the top of each bar.
Now we can quickly tell that there are 5,312 movies that received a 5 rating, while 683 received a 10 rating. Much easier than to have to go to the y axis and try to figure it out from the tick marks.
3. Descriptive Statistics
Assuming you are presenting this chart and only this chart to your audience, there are a few key descriptive statistics that are still missing. Your audience would probably be most interested in the mean rating for the entire population of horror movies voted on, as well as the total number of movies that was voted on.
There are a couple of places we can add both of these, we can add them as part of the title (a step we will be adding below soon), or inside of the graph itself. In my exprience, the title serves well for some aspects but not necessarily this information. Your audience will be much more likely to notice these information if added to the graph.
The total number of movies is easily seen at 20,938, while the mean rating is at 5.18. Your audience will now be able to make easier decisions because of the ease to access information.
4. Colors/Highlighting
Now that we have most of our graph set up for the presentation, we can think about another key aspect to our visual, the colors and highlighting. For now the bars are set in a dark grey with a grey background. We see the information we need to see, but it is pretty bland. Let’s add some color to it and highlight our descriptive statistics so they pop a bit more.
Keeping in mind that these movies are from a Horror Movies dataset I used colors that I found tied to a horror color palette. It is always good to try and use the color palette tied to the project at hand. It would be odd if I presented horror movie ratings to my audience with colors of the rainbow, instead of a color palette that matched more the dataset I was working with.
5. Formatting and Titles
Finally let’s put this altogether. There are still some missing pieces to our chart, such as the title, updating the axis titles, and taking care of formatting across our various labels. This should help us put everything together and continue to make it easier to read.
I updated formatting on many things in order to present the data better. From top to bottom from the code:
- Increased the size of all data labels to size = 6. With them being bigger, they are easier to read.
- Updated the line color, type, and size of for the average rating line. Now it matches the color of its label and is a little bigger so it stands out more.
- Made the labels for “Number of Movies” and “Overall Average Rating” to be bigger and pushed them further up the y axis, so there was more space between them and the data labels. I italicized these to make a point that these are special and descriptive statistics.
- Updated the theme of the graph by removing the grey background and adding black lines for each axis. I then reset the size of all of the font for the title (which I also added), and the axis titles.
- Set the bars of the histogram to be flush with the x-axis, and removed the buffer that was there before.
This is only one way, of many ways, to make a histogram better. Thank you for reading. Here is another color variation we could have used for the graph.