SQL Data Exploration Project

sanitizing, cleaning, hands-5095967.jpg

Title: Diving into Pandemic Patterns: An SQL Exploration of Global and Bangladesh-Specific COVID-19 Data

Introduction:

Leveraging SQL as an important tool for data analysis, I dived into an extensive investigation into a rich dataset encompassing global COVID-19 data. My exploration was meticulously detailed, concentrating on global deaths, vaccinations, and a particular emphasis on the impact of the pandemic in Bangladesh. This project allowed me to navigate through a big-data, calculate death percentages, compare infection rates across regions, and unearth patterns hidden in the data. As we delve deeper into this journey, you’ll observe how each SQL query not only refines our understanding but also unravels the complex narratives embedded within this data. This significant exploration underscores the intersection of data analysis, SQL proficiency, and understanding the implications of a global health crisis.

Data Source:

I sourced my data from Our World in Data, a reliable online science publication. The original dataset came in CSV format, containing extensive details about COVID-19 deaths and vaccinations worldwide.

Data Import:

To streamline the analysis, I split the dataset into two distinct Excel files: CovidDeaths and CovidVaccinations. I imported these into Microsoft SQL Server Management Studio, each holding an impressive 327,892 rows of data.

Importing the data was not without its challenges. I encountered some initial hurdles due to missing Microsoft Access Database Engine components. However, after troubleshooting and careful research, I installed the necessary components and successfully imported the data.

In-depth Data Exploration:

In this section, I decided to dive deeper into the dataset and perform more specific and targeted queries. I chose to focus on certain columns, calculate death percentages, examine infection rates, identify countries with the highest rates, examine cases by continents, and finally join tables to correlate data.

1.Examining the Key Variables:

My first goal was to narrow down the range of variables that I wanted to explore. For this, I began by selecting key columns from both the ‘CovidDeaths’ and ‘CovidVac’ tables.

Query:


SELECT location, date, total_cases, new_cases, total_deaths,population
FROM PortfolioProject.dbo.CovidDeaths
ORDER BY 1,2 DESC
--and
SELECT location, date, total_tests, new_tests, people_vaccinated, new_vaccinations
FROM PortfolioProject.dbo.CovidVac
ORDER BY 1,2 DESC

Result:

By ordering the results by location and date in descending order, I could examine the latest data first. The first query gave me insights into the total and new cases, the total deaths, and the population size for each location. The second query provided details on the total and new tests conducted, and the people vaccinated in each location. This approach ensured a streamlined data set for my subsequent analysis.

2. Calculating Death Percentages:

I wanted to calculate the death percentages in Bangladesh, so I used the following SQL command:


SELECT location, date, total_cases, total_deaths, 
(total_deaths / total_cases)*100 as 'DeathPercentage'
FROM PortfolioProject.dbo.CovidDeaths
WHERE location like '%bangladesh%'
ORDER BY 1,2

However, I faced two issues: integer division leading to inaccurate results, and division by zero causing errors. To fix this, I used the CAST function to ensure decimal division for accurate results, and the NULLIF function to prevent division by zero errors.These adjustments addressed the issues and gave me the accurate death percentages for each date in Bangladesh.

The adjusted query looked like this:


SELECT location, date, total_cases, total_deaths, 
(CAST(total_deaths AS DECIMAL) / NULLIF(CAST(total_cases AS DECIMAL), 0))*100 as 'DeathPercentage'
FROM PortfolioProject.dbo.CovidDeaths
WHERE location like '%bangladesh%'
ORDER BY 1,2

Result:

After these adjustments, the query worked as expected and I was able to retrieve the accurate death percentage for each date specific to Bangladesh. These adjustments illustrate how data type considerations and potential undefined operations can greatly impact calculations in SQL, and how important it is to account for these considerations when writing queries.

3. Examining Infection Rate in Relation to Population:

To comprehend the pervasiveness of the infection, I determined the infection rate as a proportion of the total population. I utilized the NULLIF function again to handle potential division by zero errors.


-- Looking at total cases vs population
SELECT location, date, total_cases, population, 
       (CAST(total_cases AS DECIMAL) / NULLIF(CAST(population AS DECIMAL), 0))*100 as 'InfectionRate'
FROM PortfolioProject.dbo.CovidDeaths
WHERE location like '%bangladesh%'
ORDER BY 2 DESC

Result:

This query was aimed at understanding how widespread the infection was in relation to the total population. Similar to the previous query, here too, I used the NULLIF function to avoid potential “division by zero” errors when the population count was zero. The CAST function was again used to ensure appropriate data types were in operation. The output displayed the progression of infection rates in Bangladesh. This information is critical in understanding the spread of the virus within a population over time.

4. Ranking Locations by Infection and Death Rates: My aim here was to rank countries based on the severity of infection and death rates. I had to group and order the data to achieve this objective.

SELECT location, population, 
MAX(HighestInfectionCount) AS 'HighestInfectionCount', 
MAX(PercentPopulationInfected) AS 'PercentPopulationInfected'
FROM
(SELECT location, population, 
CAST(total_cases AS DECIMAL) AS HighestInfectionCount, 
(CAST(total_cases AS DECIMAL) / NULLIF(CAST(population AS DECIMAL), 0))*100 AS PercentPopulationInfected
FROM PortfolioProject.dbo.CovidDeaths) sub
GROUP BY location, population
ORDER BY 4 DESC

Result:

In this query, I used the MAX function to identify the highest recorded infection count and percent population infected for each location. This allows us to rank the locations based on the severity of the infection. The subquery was used to create a temporary table, which was then grouped by location and population, thus organizing the data into understandable and relevant units. The result was an ordered list of countries based on infection rates, shedding light on regions with the most severe impact from the pandemic.

5. Identifying Countries with the Highest Death Count Per Population:

In this stage of the data exploration, the focus was on understanding the mortality rates due to COVID-19 across different countries. Specifically, the aim was to identify countries that had the highest death count in relation to their population.

Query:


SELECT location, population, 
       MAX(HighestDeathCount) AS 'HighestDeathCount', 
       MAX(PercentPopulationDeath) AS 'PercentPopulationDeath'
FROM
  (SELECT location, population, 
          CAST(total_deaths AS DECIMAL) AS HighestDeathCount, 
          (CAST(total_deaths AS DECIMAL) / NULLIF(CAST(population AS DECIMAL), 0))*100 AS PercentPopulationDeath
   FROM PortfolioProject.dbo.CovidDeaths) sub
GROUP BY location, population
ORDER BY 1 

Result:

This query provided a list of countries, their population, their highest death count, and the corresponding percentage of the population that died due to COVID-19.

By first creating a subquery that calculated the total deaths and the percentage of the population that died due to COVID-19, the main query was then able to group these results by location, providing the maximum value for each location. This structure allowed the query to handle potential division by zero errors using the NULLIF function.

6. Breaking Down Cases by Continent:

To visualize the larger geographical picture, I segmented the data by continent. The goal was to gain a broader understanding of the global impact of the pandemic in terms of death count relative to the population of each continent. The initial attempt was to group the data by both continent and population.

Query:


SELECT continent, population, 
       MAX(HighestDeathCount) AS 'HighestDeathCount', 
       MAX(PercentPopulationDeath) AS 'PercentPopulationDeath'
FROM
  (SELECT continent, population, 
          CAST(total_deaths AS DECIMAL) AS HighestDeathCount, 
          (CAST(total_deaths AS DECIMAL) / NULLIF(CAST(population AS DECIMAL), 0))*100 AS PercentPopulationDeath
   FROM PortfolioProject.dbo.CovidDeaths) sub
GROUP BY continent, population
ORDER BY 3 DESC

However, this query returned NULL values and multiple entries for the same continent, like North America. The reason was that the query was grouping by both continent and population, which means SQL Server returned a row for every unique combination of continent and population. This resulted in multiple rows for the same continent due to distinct population figures in the data.

Updated Query:

To resolve this issue, the data was grouped only by continent and the sum of the population and total deaths was taken instead of the maximum


SELECT continent, 
       SUM(CAST(population AS DECIMAL)) AS TotalPopulation,
       SUM(CAST(total_deaths AS DECIMAL)) AS TotalDeaths, 
       (SUM(CAST(total_deaths AS DECIMAL)) / NULLIF(SUM(CAST(population AS DECIMAL)), 0))*100 AS 'PercentPopulationDeath'
FROM PortfolioProject.dbo.CovidDeaths
WHERE continent IS NOT NULL
GROUP BY continent
ORDER BY 3 DESC

Result:

The output of this query provided a holistic view of the total population, total deaths, and the percentage of the population that succumbed to COVID-19 for each continent, illuminating the global impact of the pandemic.

7. Examining Total Vaccinations:

At this point, my investigation led me to examine total vaccinations in relation to population.

Query:


SELECT dea.continent, dea.location, dea.population, vac.new_vaccinations
FROM PortfolioProject.dbo.CovidDeaths dea
JOIN PortfolioProject.dbo.CovidVac vac
ON dea.location = vac.location
AND dea.date = vac.date
WHERE dea.continent IS NOT NULL
ORDER BY vac.new_vaccinations DESC

Result:

I wanted to examine the total vaccinations in relation to the population. For this, another JOIN operation was performed between the Deaths and Vaccination tables to get relevant data in one place. The query filtered out any rows where the continent was NULL to ensure meaningful data analysis. The data was then sorted in descending order of new vaccinations, providing an overview of the vaccination progress across different locations. The query output was a descending list of locations based on their new vaccinations, reflecting the global efforts to counter the virus. 

8. Utilizing CTE (Common Table Expressions) and Temp Tables for Advanced Calculations:

Complex calculations often require intermediate steps and temporary results. To streamline such calculations and keep the code neat, we can use techniques like CTE and temporary tables. Here, the goal is to calculate a rolling count of people vaccinated, and the rate of vaccination in relation to the population in Bangladesh.

a) Using CTE

Query:


With PopvsVac (Continent, Location, Date, Population, New_Vaccinations, RollingPeopleVaccinated)
as
(
Select dea.continent, dea.location, dea.date, dea.population, vac.new_vaccinations
, SUM(CONVERT(decimal,vac.new_vaccinations)) OVER (Partition by dea.Location Order by dea.location, dea.Date) as RollingPeopleVaccinated
--, (RollingPeopleVaccinated/population)*100
From PortfolioProject..CovidDeaths dea
Join PortfolioProject..CovidVac vac
    On dea.location = vac.location
    and dea.date = vac.date
where dea.continent is not null 
--order by 2,3
)
Select *, (RollingPeopleVaccinated/Population)*100 as RollingPeopleVacRate
From PopvsVac
WHERE Location like '%bangladesh%'
ORDER BY 3 DESC

Result:

The aim here was to generate a rolling count of vaccinations and the vaccination rate for each date in Bangladesh. The CTE (Common Table Expressions) was used as a temporary result set for this query. The CTE, PopvsVac, contained a JOIN operation between the Deaths and Vaccination tables and an OVER clause for calculating the rolling count of vaccinations. This calculation was partitioned by location and ordered by date, allowing a cumulative count for each location over time. The final SELECT statement outside the CTE calculates the vaccination rate by dividing the RollingPeopleVaccinated by Population. And the result was presented for Bangladesh with the help of WHERE clause.

b) Using Temp Tables

Temporary tables offer a way to manage and separate data during a SQL session, making complex computations more manageable.

Query:


DROP Table if exists #PercentPopulationVaccinated
Create Table #PercentPopulationVaccinated
(
Continent nvarchar(255),
Location nvarchar(255),
Date datetime,
Population numeric,
New_vaccinations numeric,
RollingPeopleVaccinated numeric
)

Insert into #PercentPopulationVaccinated
Select dea.continent, dea.location, dea.date, dea.population, vac.new_vaccinations
, SUM(CONVERT(bigint,vac.new_vaccinations)) OVER (Partition by dea.Location Order by dea.location, dea.Date) as RollingPeopleVaccinated
From PortfolioProject..CovidDeaths dea
Join PortfolioProject..CovidVac vac
	On dea.location = vac.location
	and dea.date = vac.date

Select *, (RollingPeopleVaccinated/Population)*100  as RollingPeopleVacRate
From #PercentPopulationVaccinated
WHERE Location like '%bangladesh%'
ORDER BY 7 DESC

Result:

Similar to the CTE method, this query calculates the rolling count of vaccinations and the vaccination rate for each location. Here, instead of a CTE, a temporary table #PercentPopulationVaccinated is created to hold the intermediate results. The temporary table is used in the same manner as the CTE, making the complex computation more manageable. This query also provided a timeline of the cumulative number of people vaccinated and the vaccination rate for Bangladesh. This detailed insight can help understand the vaccination progress in Bangladesh.

Conclusion

Through this exploratory journey, we’ve delved into the multifaceted world of COVID-19 data, revealing insights into global death counts, infection rates, and vaccination trends. We’ve witnessed the power of SQL in unmasking geographical disparities in infection and death rates, from the country-level to continent-level, exposing the pandemic’s true global impact. However, this is just the beginning. These findings pave the way for more complex analyses, from scrutinizing the factors causing disparities to studying correlations with country-specific variables. As we continue to face the COVID-19 crisis, these data-driven insights are invaluable – informing today’s actions and shaping tomorrow’s strategies. Data exploration and analysis thus play a critical role in navigating this global crisis, a testament to their potential and importance.

SPECIAL THANKS: Alex The Analyst