How can I get started on a Data Project? A simple example
Have you completed online courses or university modules on something data related (data science, analysis, etc.) but are still not sure how or where to start on an independent project in order to build your portfolio and apply what you have learned? Being able to do this is a highly important skill if you want to get into the data space and successfully remain there. As you become better at doing this, over time you could have a great portfolio that will impress your employers.
In this post, I will give you some hands-on step by step guide on how you can get started with a simple project. What I’m about to show you was implemented in Python, but you could just as easily do it on R, Tableau, Power BI or even Excel! The aim of this post is to provide a framework to get you started with a project of your own and encourage project-based learning.
First Steps: The (not so) great expectations
The first step is to have manageable expectations. Before you start on any project, always think about the aim and scope of the project. If there was one thing that your project expressed, what would it be? The scope will concern how much data and how many steps will I need to take to get there? Don’t worry if you don’t have a detailed plan at this point. Something like, “I want to see by how much CO2 emissions have increased and who or what are the main emitters”, is good enough for a start (in fact this is the project I will use to showcase what I mean). As for the scope, “I will need some data on CO2 emissions — there are lots of public datasets I can focus on. I will choose a minimum of 1 to a maximum of 4 datasets. The analysis will be simple. It will only focus on generating the currently existing descriptive statistics, without any attempt to forecast or build more complex models.”
Once you have an idea of your aim and scope, you can begin to plunder the many different open data sources and repositories available online. If you’re not sure where to start looking, you can skim through this (or many other) articles: https://learn.g2.com/open-data-sources
For this tutorial, I found what I was looking for here: https://github.com/adventuroussrv/Climate-Change-Datasets. Going down the list, I picked up three different datasets on CO2 emissions from here: https://datahub.io/collections/climate-change.
Step 2: Honing your questions
You will often run into datasets that will have data that can answer a lot of questions. That’s why it is important to ask just a few. This can often be anything that you are interested in (top-down approach — leading with the question), but if you only have a vague idea, it can be helpful to look at your datasets and come up with questions that it can answer for you (a bottom-up approach — leading with the data to generate questions). In the current example, I had three different datasets that lent themselves nicely to answer the three following questions:
- By how much has CO2 emission increased since 1950’s?
- Which type of fuel has contributed the most to CO2 emissions in the last 50 years?
- Who are the major emitters?
Remember that without specifying your questions into something that can be answered by your data at this point, you can’t really advance to the next stage because these questions will determine which parts of the data you will focus on to clean, analyse and visualise.
Step 3: Begin Cleaning and Analysing the Data
Now that we have the questions that we want to focus on, we can begin to focus on the data. To get the full implementation of my code used to do the analysis, please go here. The following will talk you through the general steps taken without focusing on the implementation (as I mentioned the same approach could be carried out in another language or tool besides python). If you wish to follow through with the code, it is commented well enough to enable you to do so.
When the data is loaded, the top few rows looks like this:
Date Decimal Date Average Interpolated Trend Number of Days
0 1958–03–01 1958.208 315.71 315.71 314.62 -1
1 1958–04–01 1958.292 317.45 317.45 315.29 -1
2 1958–05–01 1958.375 317.50 317.50 314.71 -1
3 1958–06–01 1958.458 -99.99 317.10 314.85 -1
4 1958–07–01 1958.542 315.86 315.86 314.98 -1Rows: 727
The Decimal data is a fractional (decimal) transformation of the date for a particular year, while the value of -1 in ‘Trend Number of Days’ denotes that there is no data for the number of daily averages for the given month. To see what all of the columns mean, go to: https://datahub.io/core/co2-ppm.
We are interested in ‘Date’ and ‘Average’, which contains the mean CO2 mole fraction determined from daily averages.
Cleaning and Pre-processing
As we want to get the average across the months for a given year, we can do this by separating out the year part from the ‘Date’ column. It is only noteworthy that some of the values for ‘Average’ are -99.99. This indicates that no data was collected for that month, so we get rid of that. Then the new data looks as follows:
Date Decimal Date Average ... Trend Number of Days year
0 1958-03-01 1958.208 315.71 ... 314.62 -1 1958
1 1958-04-01 1958.292 317.45 ... 315.29 -1 1958
2 1958-05-01 1958.375 317.50 ... 314.71 -1 1958
4 1958-07-01 1958.542 315.86 ... 314.98 -1 1958
5 1958-08-01 1958.625 314.93 ... 315.94 -1 1958Rows: 720
A new column on the right called ‘year’ has appeared, while the number of rows has decreased from 727 to 720 (there were 7 months for which no data was collected).
Now that we have the year for each record separated out, we can group the average emissions by the year to get the data at the grain of the year (average emissions per year).
1962 318.450833rows: 61
As you can see, we have the average emission per year recorded now and the number of records has gone down from 720 to 61 (the data contains records over 61 years).
To see the trend throughout the years, we can plot a line chart with the year against the average CO2 emissions (mole fractions).
The trends of increase in CO2 mole fraction is clear to see from this data. It has gone from around 300 mole fraction to over 400 in 61 years, around a 33% increase.
Emission by fuel type
Now to see which fuel type has been most heavily linked to emission, we can turn to our second dataset. This is how it looks:
Year Total Gas Fuel ... Cement Gas Flaring Per Capita
1751 3 0 ... 0 0 NaN
1752 3 0 ... 0 0 NaN
1753 3 0 ... 0 0 NaN
1754 3 0 ... 0 0 NaN
1755 3 0 ... 0 0 NaN
... ... ... ... ... ... ...
2006 8370 1525 ... 356 61 1.27
2007 8566 1572 ... 382 68 1.28
2008 8783 1631 ... 388 71 1.30
2009 8740 1585 ... 413 66 1.28
2010 9167 1702 ... 450 59 1.33
The data records start all the way back from the year 1751. Only ‘Solid Fuel’ type had any carbon emission recorded. This is likely due to the fact that solid fuels were the dominant fuels back in the day and also because the data collection process had not advanced much. In fact, the data for all the other fuel types and ‘Per Capita’ is only populated meaningfully when we reach 1950. So the analysis for this was restricted from 1950–2010.
When we plot a stacked area graph of the different types of fuels against the emissions (in million metric tons of carbon), we get the following result:
Gas Flaring and Cement are the highest emitters, while Gas Fuel is the lowest. We can’t say why this is the case with our current data, but that is outside of this project’s scope but could be incorporated for a future project within the theme of climate change. Once again, the general trend for all types of fuel has been to increase over the years from 1950–2010. This is further evidenced in the chart below to the left.
The CO2 emission per capita from 1980–2000 shows some stagnation, but it picks up again after 2000. This likely has some historical reason behind this blip in increase, but once again it could be investigated in a future project.
Major (and minor) Emitters
So far the data has showed us that the trend of CO2 has indeed increased and which fuels we use emits the most and the least. Now we can focus on the countries that have the greatest and least CO2 emissions. Let us look at the data:
Year Country ... Per Capita
1751 UNITED KINGDOM ... 0.00 1752 UNITED KINGDOM ... 0.00
1753 UNITED KINGDOM ... 0.00
1754 UNITED KINGDOM ... 0.00
1755 UNITED KINGDOM ... 0.00
... ... ... ...
2014 VIET NAM ... 0.49
2014 Zimbabwe ... 0.22
In total there are 10 columns in this dataset and 17232 rows. However, we are only interested in finding out which of the countries are the highest and lowest emitters so will only need the columns for countries and the measure for emissions. Once again the data starts from 1751, but for completeness and consistency, we will again focus on the period from 1950–.
This time we have to aggregate the data across the years for each country. We will do this by summing up all the emissions from 1950–2014 for each country. We will also look at the mean emitters per capita over that same period because that does not necessarily correlate with total emission.
By sorting the data from highest to lowest total emissions, we get the following result:
As can be seen, the result is something that is widely circulated and has become mainstream (USA, China, former USSR, etc.) are the highest total emitters. The lowest are perhaps less heard of. It is interesting to see that Antarctic Fisheries has less than 60 metric tons of carbon emitted between 1950–2014 according to this data.
The top 5 per capita emitters are not the same as the highest total emitters as we can see below.
Netherland Antilles and Aruba has the highest emission per capita on average with around 24 metric tons of C, followed by Qatar which has around 13. To put that into perspective, the average (across 64 years) person in the highest per capita emitting country emits in 3 years the same amount of CO2 as the lowest emitting country has over 64 years in TOTAL!
Getting the Project Out
Once you have fulfilled the aim of your project within the scope you provided yourself, then it is time to get it out where others can see. I normally push or pull my projects to GitHub. If you don’t have an account, it is worthwhile setting one up. You should always try to comment and document the steps you took in the project so that if you decide to share it online (like me) through a blog or an article, then you have minimum friction in getting this done.
There you have it, you’ve independently completed a data project and shared it where prospective employers, clients or others can see. :)
Although I’ve gone through the analysis to answer my chosen questions, I haven’t emphasised on a key skill for completing a project. An essential aspect of project-based learning is that you may not know how to implement certain things that you want. I’ve known a few people who give up early because they can’t recall or have not learned everything that they need. That’s where the learning part of ‘project-based learning’ comes in. Throughout the process, always be willing to google, use stackoverflow, read documentation on the library/tool you are using or any other host of websites, books and articles to find answers to your questions. The more often you do this, the better you will get at searching and filtering out relevant from irrelevant answers. The chances are that your learning will also stick because you are aware of the context in which you used the knowledge. This way your project will not only be rewarding because you finished it, but also because you learned new things along the way! It is also important to build a foundation to your learning by reading books, articles or papers when you can — but that’s a separate topic and dependent on your goals.
I hope this simple exercise has shown you how you can begin a data project of your own. Over time, you could end up building a great portfolio (as I am hoping to)! If you are interested in collaboration, you can visit my site to learn and contribute to certain projects together.
Once again, the script with the implementation for this project can be found here. Also remember to have fun along the way. ;)