A 'tool-kit' to visualize big data

Rotem_D
Jan 12, 2021
6 min read

While the long winter break delayed this post, it also helped me come-up with more ways to explore this week’s data. The end result is a mix of visualizations, including tabulations of the data, all providing multiple ways to present large amount of information (the post is a little long this week, for the TL;DR version, use the links below to jump to your preferred section).

Data

This week, I utilize data collected by Rami Krispin in the R package USgas. The package provides data regarding the consumption of natural gas in the US over time. It includes datasets for the total US consumption (state-level 1997-2019), monthly consumption, and residential monthly consumption by states (1989-2020). The structure of these datasets allows a researcher to unpack and then display the information in a variety of ways. In this blogpost, I present some of the possible visualization options (additional versions and plot options are details on my Github page).

Visuals 1: Map

I begin with a map that presents the highest consumption by states along the entire time-frame. Using the US total dataset, I add variables that identity the highest consumption by state, and the associated year for max consumption. After some wrangling of the data (removing unnecessary input and focusing on the lower 48 states), I join this data with location information for the US states.

In the map below, I present data that focus on the highest level of consumption for each state in the entire time frame. The color gradient demonstrates the change in the level of "highest consumption".

How do we interpret this map? The states in red and related shades (such as Texas or California) have the top figures for consumption among the list of highest values for all states. The ones in blue shades represent the highest for these states, but compared to states like Texas, their high consumption values were much lower (for example, Arizona has higher max consumption than New Mexico, but lower than California or Texas).

Visuals 2: Treemaps

The lower-48 states map offers a general description of the total consumption data. In order to incorporate the year of max consumption, I use a treemap plot. This type of visual display allows to show relative size of the data (in this case, the amount of max consumption) and separate observations by both state and year. The plot below uses variations in the rectangles’ size and color to show the changes in the amount of max consumption. In addition, it is organized in blocs of years.

How to read this plot? Take 2019 as an example: Texas is depicted with the biggest rectangle, and the color pink - both indicate that its max level of consumption was registered in 2019, and it was the highest among all the states that had their own max consumption that year (for instance, Ohio and Florida had smaller amounts than Texas in 2019). For 2000, California is the only state with max consumption in that year, this is similar for 2015 with NY state (other versions of the data using treeplots can be generated using the full code, see my Github).

Visuals 3: Share of total gas consumption (monthly data)

The USgas package also includes a dataset of residential monthly consumption by states. One interesting way to look at this data is to compute the share of consumption for several states out of the whole US consumption.

This analysis requires to re-shape the data to fit a wide format that displays the time as one column (month-year, 1989-2020), and then have separate columns for the monthly consumption of each state. After running this procedure, I explored the data manually and identified the eight states that had high levels of consumption over multiple month-year observations.

Next, for those top 8 states, I generated a separate dataset of their mean values of consumption, and then computed the share that each of the top 8 represented out of the total US consumption (per month). One last variable in this dataset accounts for the share of consumption for the Rest - all states other than the top 8.

To visually display the 'share data', I decided to use pie charts. However, since the data measures consumption per month, it requires focusing on specific month-year observations. I chose January 2012 and June 1993. My logic was to focus on separate periods (the early 1990’s versus 2010’s) and different time of the year (Winter and Spring) so we can try and identify interesting patters in the monthly consumption data.

The figure below uses a variation of pie charts (donut charts) in which the share of each of the top 8 is displayed, along with the share of the rest. The data is separated into two plots for the time frames I chose - Winter 2012, and Spring 1993.

One thing to learn from these figures is the change in consumption in California which dropped by more than 6%, while the rest in the top 8 remained relatively stable. Finding causes for this change requires more information, but it may be a function of the weather (warmer in California in the winter) or perhaps the introduction of other energy sources (solar panels or wind turbines).

Visuals 4: Tabulate the data

Visual displays refer mostly to plots and figures. However, we can also use tables to display different bits of the data. Here, I show some examples using the KableExtra package.

First, based on the US total dataset, I create several smaller datasets in which I group the consumption data by states and then compute several indicators: minimum and maximum consumption, and the relevant year for each. Also, the mean, median and IQR (Inter-Quartile Range) of consumption. I complete the data preparation phase by joining all these into a single dataset of all states.

While it's possible to show the full data, it involves using a large table which makes it harder to grasp when presented on a smaller screen. Instead, I present below a table for the top 10 in consumption (mean is larger than 65,000 per year), and add to the table the values for these states' minimum consumption (and year) as well as the mean, median and IQR. From a preliminary outlook, it seems that population size and on some level territory size matter when computing this information.

In addition, I generate a similar table for the bottom 10 in consumption (with similar columns to the table above). Again, population size seems to matter for these figures (for other versions of these tables, check the full code in the Github file).

Lastly, I wanted to incorporate other geo-type data into this kind of visualization. In order to do that, I used the census classification of the US to four regions and nine divisions (I forked a Git project that compiles this census data into an excel file). I combined both data files and included additional columns for each state’s code (abbreviated name) and geographic region and division. The data is organized into layers/groups, starting with the census’ four regions, then the nine divisions, and finally by their alphabetic order within each subgroup.

The final version of the table is quite big, so I used the different save options for KableExtra tables, and saved it into an .html file. Below is a screenshot showing part of the full table. This type of display makes it easier to present large amount of data on one page, and reduces the confusion.

To recap this (rather long) project - my main objective was to present various visualization options for big data. I used different datasets and presented it in plots and figures, but also with tables in which the information was layered/grouped for easier interpretation.

While the original datasets involve a limited type of information (year, state and consumption data), I leveraged simple descriptive stats commands to generate additional information (maximum, minimum, mean, median and IQR) and displayed it. Another solution to ‘get more out of the data’ is to group it by years, or geographic location. Lastly, I used the gather command to re-shape the data into a wide format. These operations presented more ways to analyze and present this type of data (in my case, the share of monthly consumption by states).

The plots and maps in this post were generated with the ggplot package options. The tables were generated with the KableExtra package. If you are interested in the full code details, as always, it is available on my Github page.

Rotem Dvir

A 'tool-kit' to visualize big data

Recent Posts

コメント