Project 2 - Every Breath You Take


Visualization link

Sai Krishnan Thiruvarpu Neelakantan
Praveen Chandrasekaran
Varsha Jayaraman
Abdullah Aleem

Purpose of the project:
`I’m supposed to be in school, but instead I’m out here trying to make sure that my kids don’t grow up in a wasteland.` — Arielle Geismar, 17, An excerpt from The New York Times.
The above statement was stated by a teen student from Manhattan concerning the Climate Change. When many such youngsters took to the roads on March the sixteenth, concerned about their progeny at such a young age, we know that something is really not going the way it should. Today, Climate Change is an impending doom that we are facing collectively. It is not something that we can afford to be negligent or ignorant about.
But, is it just a myth? Are we really doomed? Where's the proof? If true, how do we go about it? What do we do about it? WHAT DO WE KNOW ABOUT IT?
In this epoch, there's very little that we cannot do to help or change any situation for the better. This is that time, wherein we can conquer the world by just clicking a few keys on our electronic devices. But how do we use this super power to solve our crisis? By studying data, by drawing meaningful patterns from it, by infering a sequence of events correlating a consequence which could serve as a forewarning to us, thus saving millions in terms of nature and other resources.
This is precisely what we aim to do with our Project, Every Breath We Take. We have collated the Air Quality Index data of The United States of America to kickoff what could serve as a forceful reminder of where we are headed with respect to Our Existence. We have collected data regarding various attributes which correspond to the Air Quality Index such as Wind Speed, Temperature, Levels of various Pollutants in the air such as Carbon-Monoxise, Nitrous Oxide, Sulphur-Dioxide, Ozone, etc. We have analysed the trends of the above stated components over the span of the last forty years to establish a significant change in our ecosystem. We have toyed with the data from all angles and explored various means of communicating this with our Brethren.

The application analyses the Air Quality data for different states and counties across the US from 1990-2018. The Air Quality data has been categorized into Annual, Daily, Hourly data files. The application analyses the same data in multiple ways (Pie Chart, Bar Chart, Line Chart, HeatMap, Table) to derive insights from the changing trends over the years. Apart from analysing the Air Quality for the US, this application also analyses the Air Quality data for Hong Kong as well. We have experimented with different layouts considering user experience as well.

Once you click on the link, the application will consist of five tabs on the sidebar for navigation purposes. The detailed description of the tab functionalities are as follows.

Annual Data:
Under Annual Tab, the application enables the user to choose a year from 1990 to 2018, then a State and County (either from the default list or from the list of all state/counties)
1) The user will be able to see the AQI Category days and Pollutant Data in Pie Chart, Bar Chart, and Table format.
2) The user will then be able to see the changing trends in pollutants and AQI values over the years from 1990 to 2018. The user has the option here to choose what pollutant/Type of AQI values to plot in the line graph.
3) Then, we have these data displayed in the format of a table in terms of percentage, the user can search for a particular year.
4) The application then displays the heat map for the pollutants, this heat map displays the relative badness of the selected pollutant across the US. The user can choose the pollutant from the options to see that pollutant's effect across the US. The user has the slider to choose to rank the number of relatively badly affected counties for that pollutant. For example, if the user sets the slider to 100, then the map shows the top 100 counties that are most affected by that selected pollutant. Apart from this, the user has the option to choose the heat map color and value breaks. The user can also hover over the map to see the details of each county.
5) The same kind of heat map is given for the AQI values. The user can choose between Max AQI values, 90th Percentile, and Median AQI value for the selected year. The user also has the option to choose the color and breaks here as well.
6) The user will see a county map to show the location of that particular county in the United States.

Daily Data :
Now under Daily Data tab,
1) The user will be able to see a stacked bar chart which shows the contribution of each AQI days in percentage for each month of the selected year.
2) Then, the user can see a line chart which shows the AQI value of each day in the selected year with a major pollutant of that day.
3) Also, the user will be able to see the table of the contribution of each AQI days for each month of the selected year.

Hourly Data:
This functionality is available only for the year 2018. The user can choose any date from the date picker given in the tab. Once the user chooses the data, the application loads the list of data available for that particular date chosen (Pollutants, Wind, and Temp). From the checkboxes, the user can select any number of plots that to be plotted in the line graph to compare the pollutants along with Wind and Temperature data. The user can also choose the units in which he can view the data.

Hong Kong:
The data is available for the last 90 days for the country. The user can choose any of the dates from the data picker given. Once the user chooses a site(location), the available data for that date loads and the user can select from the checkboxes the pollutants which are to be plotted. The daily and monthly data is also plotted in the line graph for Hong Kong. Also, the user will be able to see the location of the Site chosen in a Leaflet map.

This tab contains information about the coursework, who developed the project, what libraries are being used to visualize the data and the data source from which the data is downloaded. This section also contains some information about the description of the application and the purpose of doing this project.

Additional information about Heatmap:
The tool that is used for the heatmap is "tmap". tmap takes input in "shp" form, which has a geometry row corresponding to each county which describes its boundaries. This data was taken from the website ( We used data for 2017 and 20m resolution level given the nature of task as we don't need very precise boundaries and a lower resolution files loads quicker.
Data Preprocessing (Map) :
This data had state fips code, unlike the air quality data which had state names. We used fips function from the cdlTools to convert it into state name. Such a conversion is needed when we join (left join) data. After conversion there were some differences in the names of states and counties between the two data set, hence we did some preprocessing to make sure the names are synchronous.
The changes we made in the "shp" file:
Counties column :
replaced "St. " with "Saint "
replaced "Ste. " with "Sainte "
replaced "LaSalle" with "La Salle"
replaced "Charles City" with "Charles"
replaced "Carson City" with "Carson"
replaced "Doña Ana" with "Dona Ana"
replaced "Cataño" with "Catano"
replaced "Bayamón" with "Bayamon"
State column:
replaced "Deleware" with "Delaware"
The changes we made with aqi file :
Removed which spaces in the ending and beginning of each entry of state and county (helps with Alaska)
replaced " City" with "" in county column
replaced "St. " with "Saint " in county column
replaced " Of ", " of " in state column (helps with district of columbia).
Even after all the changes there were three counties we could find a match for, 2 very from the city of Mexico (hence ignored, because we are plotting only for America) and 1 was Baltimore (City) from Maryland but we didn't have anything corresponding to that with my cartographic boundary shape file and hence ignored.
What you can do with the map :
The heat map should the relative badness of different counties. This is an interactive map in which you can zoom in and out at any state. The map defaults to a zoom at mainland united states but you can easily zoom out if you are interested in Alaska, Hawaii or Puerto Rico. No names/values are shown on the map to avoid cluttering, but you can click any county to see the state name, county name and the exact quantity being measured.
There are 4 inputs for the pollutant map and 3 for the aqi map. The first one ask which quantity you want to see the map for. The second one is a slider which ask the user to pick the number of counties you want to see the data for. This data is sorted according to the relative badness and help the user analyse however he wants. The third inputs allows the user to switch between two beautiful color palettes (both of the color blind friendly). The 4th input asks to user to define the breaks he want to see for the data. There are two options for this 1) pretty (rounded equal intervals) and natural break (uses jenks method to define breaks). Using an interactive interface and such a level of control the user can interactively visualize the available data.

Libraries Used:


About the data

The data was collected from the US EPA website. You can download all the Annual, Daily, Hourly data files from
The data files for Hong Kong can be downloaded from
The data files for all Heatmaps can be downloaded from

The Data preprocessing for this application is done for Daily and Hourly Data of the United States. Apart from this preprocessing was done separately for Hong Kong data.
In the daily data, all the daily files from 1990 to 2018 were loaded and only necessary columns were selected and removing the State.Code, County.Code, Defining.Site and Number.of.Sites.Reporting columns which are not used for this application. This preprocessing helped us reduce the daily files size from 650 MB (approx.) to 400 MB (approx.).
Now, coming to the hourly data for 2018. We had 8 huge separate files for each pollutant, Wind and Temperature data. The first step of this data preprocessing is the extraction of dates with respect to the county from all the files so that we can develop the functionality to show the user which data is available for the selected County and date under hourly section. We extracted the date from 18 individual files and compiled into a single file.
After this, for plotting purpose we had done some preprocessing since we cannot load such huge files into application directly. So we took each file separately and divided into 12 files based on the month. During this process, we removed the columns in the data which are not required for the application. Also, the values were converted into a single unit for a common scale. Then, we calculated the respective imperial units for the values (ppm to mg/m3).
Then, for plotting the heat map we needed the data for the pollutants at a daily level. So we used the preprocessed file from the above step and used that to calculate the daily value of the pollutants by aggregating the hourly values given to a single value for that day. We made 12 separate files monthly vise for the map data.
We had the hourly files for the Hong Kong data for 6 pollutants for 16 locations for the last 90 days. We made a separate file for the latitude and longitude files for the 16 locations. After that, we removed the columns not used in the application from the hourly data and saved that for the hourly plotting. Then for the daily and monthly plotting, we aggregated these hourly values to a daily and monthly format.
For these preprocessed files we'll need to run the preprocessing code. Please follow the below folder structure while running the preprocess code.
Raw Files/Daily --> Place the daily raw files for the US here
Raw Files/Monthly --> Place the Monthly files for the US here
Raw Files/Hong Kong --> Place the Individual pollutant file here
Place the preprocess.R files one folder level above the Raw Files directory and the preprocessed files will be generated. This may take up to 10 to 15 minutes of time.

Source Code

Link to download the code(also contains the data files, preprocessing code, Readme files for preprocessing.R and app.R )
YouTube Video

Role of Team Members

Praveen Chandrasekaran
1) Worked on the preprocessing of all the data files (Annual/Daily/Hourly/Honh Kong).
2) Worked on the line plotting of Daily/Hourly/Monthly data files for the country Hong Kong.

Sai Krishnan Thiruvarpu Neelakantan
1) Worked on allowing the user to choose any county, date (2018) to show different lines on the same line chart for hourly Ozone, SO2, CO, NO2, PM2.5, PM10, wind, and temperature. The functionality also has the ability to change the units.
2) Worked on the UI template with all the controls.

Abdullah Aleem
1) Worked on the Heatmap showing the locations of the top 100 counties (can change the number of counties using the slider) having issue with that pollutant based on the annual data, in terms of the largest percentage of bad days for that pollutant.
2) Worked on the pollutant map to allow the user to show a heatmap for any of the pollutants for a given day in 2018.

Varsha Jayaraman
1) Worked on the Daily AQI data for that year and county as line chart and see what pollutant was contributing the most to the AQI for each of those days and
2) Worked on the Stacked bar chart and table for each month of that year showing the number of days where the AQI was good / moderate / unhealthy for sensitive / unhealthy / very unhealthy / hazardous / unknown.

Finally, everybody contributed and worked towards the development of the website information.

Insights from the data

AQI Value has been predominantly good across US with a slightly higher value along the Southern part of California.
AQI value for Hawaii has been at it's worst, Alaska has been at a normal/good rate. The value is varying between good to moderate along the most of West Coast.

SO2 has been on a wiggly decrease since 1990 and has been constant at a minimum of 0 for the past 5 years. Few spikes in Ohio, Indiana and Hawaii.
SO2 has very high values in Hawaii and Northern parts of New York, has been good across most of West Coast. Seen more spikes than 2010.
The presence of volcanoes in Hawaii could be a major contributer to the levels of Sulphur Dioxide.

Ozone has had an increasing curve overall.
The levels seem to be moderate to worse along the West Coast, worst at the West Coast, worse to worst along the coast of Wisconsin, many small counties along the East Coast.
Ozone has normal values for Hawaii, but is seen to be shooting across the midlands of Alaska. Rates are alarming in the Midwest, around Colorado, Arizona, Wyoming, Nevada, some part of California and the cities along the East Coast.
The West Coast, being the most rapidly advancing area in the technology space, no wonder has alarming values when measured for Ozone.

Other Insights :
There is a sharp spike in the levels of PM 2.5 in 2009.
PM2.5 values are too high in Washington, parts of Oregon, Idaho and some part of Montana.
The increase in levels of PM 2.5 has been researched and in general, attributed to the presence of accumulated fine particulate matter owing to frequent wildfires along the West Coast.
CO has never risen and for more than the last 15 years, it's been at it's lowest of 0.
US has had a good value for CO across it's borders.
CO has has a good value across the West Coast.
NO2 has a normal level across US, with a few moderate spikes, in parts of Utah, Wisconsin and North Dakota.
NO2 has started on 25 and wiggled it's way back to the same values with a few peaks and valleys in the middle in Illinois, Started over 25, but has dropped drastically between 1998-2000 to a constant zero thereafter in NY, Has started very high at 50 and has dropped gradually over the years.

Comparison of three locations :
SO2 has predominantly been lesser in NY comapred to Illinois throughout the years, Constantly zero in California.
Ozone in California has peaked and valleyed around a similar average through out the years, whereas Illinois has seen a gradual increase(A steep climbe between 2015 to 2017), Ny has a rapid increase in the very beginning(1990) and has seen a similar fall between 2006 and 2008.
PM10 has been low in Illinois, no sudden changes in levels., Even lesser change in levels along the West Coast and East Coast.
PM2.5 has seen sharp spikes around 1997-1998, 2000, to an all-time high of 75% in 2009 in Illinois. It was at zero till 1997.
Similar trends in California with sharp spikes and falls.
Fewer spikes along NY around the time periods of 2007 and 1998.