SQL Data Cleaning Project using Housing Dataset

Introduction
In every data analysis project, the first and often most crucial step is data cleaning. In this blog post, we will comprehensively explore the cleaning process of the Nashville housing dataset. This dataset, boasting 56,477 rows, offers a rich variety of property sales information. But before we can unlock any insights, we must first ensure the data is clean and accessible. So, let’s dive in!
Data Import
Before we start cleaning, we need to import our dataset. We imported the excel file into Microsoft SQL Server Management Studio. Our data is now stored in a SQL database. Therefore, we used a SQL SELECT query to retrieve it:
SELECT *
FROM PortfolioProject.dbo.NashvilleHousing
This command pulls all the columns from the NashvilleHousing table in the PortfolioProject database.
In-depth Data Cleaning
Converting the SaleDate field
The SaleDate field was initially a string datatype. However, to facilitate date-based operations, we need to convert it to a Date datatype. Our first attempt looked like this:
UPDATE NashvilleHousing
SET SaleDate = CONVERT(Date,SaleDate)
This attempted to directly convert the SaleDate
string into a Date format. However, due to SQL Server constraints, which disallow changing the datatype of a column in-place, this operation failed. The workaround was to create a new column and populate it with the converted dates:
ALTER TABLE NashvilleHousing
ADD SaleDateConverted Date;
UPDATE NashvilleHousing
SET SaleDateConverted = CONVERT(Date,SaleDate)
-- Checking if it is added to the table
Select SaleDate, SaleDateConverted
From PortfolioProject.dbo.NashvilleHousing
Result:

The ALTER TABLE
command added a new SaleDateConverted
column, and the UPDATE
command filled this column with SaleDate
data, converted to Date format.
Populating missing PropertyAddress data
Our dataset had missing PropertyAddress
values. We tackled this by utilizing the ISNULL function and a JOIN operation to populate missing data from other records that shared the same ParcelID
.
Select *
From PortfolioProject.dbo.NashvilleHousing
Where PropertyAddress is null
order by ParcelID
Select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress,b.PropertyAddress)
From PortfolioProject.dbo.NashvilleHousing a
JOIN PortfolioProject.dbo.NashvilleHousing b
on a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is null
--Now we update
UPDATE a
SET PropertyAddress = ISNULL(a.PropertyAddress,b.PropertyAddress)
FROM PortfolioProject.dbo.NashvilleHousing a
JOIN PortfolioProject.dbo.NashvilleHousing b
ON a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
WHERE a.PropertyAddress is null
--verify the changes
Select UniqueID, ParcelID, PropertyAddress
From PortfolioProject.dbo.NashvilleHousing
Result:

SELECT
statement performs a self-join on the NashvilleHousing
table (meaning the table is joined with itself). This is done on the condition that ParcelID
of table a equals ParcelID
of table b and UniqueID
of table a is not equal to UniqueID
of table b. This is used to find rows in the same table with the same ParcelID
but different UniqueID
, where PropertyAddress
is null in table a. It selects columns from both versions of the table and also applies the ISNULL
function to a.PropertyAddress
, which will return b.PropertyAddress
if a.PropertyAddress
is null.

The UPDATE
statement is similar to the SELECT
statement above, but instead of just selecting, it updates PropertyAddress
in table a with the PropertyAddress
from table b, if PropertyAddress
in table a is null. This effectively fills null PropertyAddress
entries with PropertyAddress
from other entries with the same ParcelID
.The ISNULL
function checks if the first argument (a.PropertyAddress) is NULL. If true, it replaces it with the second argument (b.PropertyAddress), essentially filling the missing data from corresponding entries with the same ParcelID
.
Breaking out Address into Individual Columns
The PropertyAddress
field bundled multiple data pieces—Address, City, and State. To facilitate easy analysis, we decided to split this into separate fields using the SUBSTRING and CHARINDEX functions.
ALTER TABLE NashvilleHousing
ADD PropertySplitAddress Nvarchar(255);
UPDATE NashvilleHousing
SET PropertySplitAddress = SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1 );
ALTER TABLE NashvilleHousing
ADD PropertySplitCity Nvarchar(255);
UPDATE NashvilleHousing
SET PropertySplitCity = SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1 , LEN(PropertyAddress))
Result:

The SUBSTRING
function extracts a part of the string, and CHARINDEX
finds the position of the comma which separates address and city in the PropertyAddress
field. We created two new fields: PropertySplitAddress
and PropertySplitCity
to store these extracted parts.
Reformatting the “Sold as Vacant” field
In the SoldAsVacant
field, we had ‘Y’ and ‘N’ values, which we transformed to ‘Yes’ and ‘No’, respectively, for clarity.
Select Distinct(SoldAsVacant), Count(SoldAsVacant)
From PortfolioProject.dbo.NashvilleHousing
Group by SoldAsVacant
order by 2
Select SoldAsVacant
, CASE When SoldAsVacant = 'Y' THEN 'Yes'
When SoldAsVacant = 'N' THEN 'No'
ELSE SoldAsVacant
END
From PortfolioProject.dbo.NashvilleHousing
Update NashvilleHousing
SET SoldAsVacant = CASE When SoldAsVacant = 'Y' THEN 'Yes'
When SoldAsVacant = 'N' THEN 'No'
ELSE SoldAsVacant
END
Select Distinct(SoldAsVacant), Count(SoldAsVacant)
From PortfolioProject.dbo.NashvilleHousing
Group by SoldAsVacant
order by 2

Select Distinct(SoldAsVacant), Count(SoldAsVacant) From PortfolioProject.dbo.NashvilleHousing Group by SoldAsVacant order by 2
: This first SQL statement is finding the distinct values ofSoldAsVacant
and the count of each distinct value. The result is ordered by the count (the second column).The second
Select
statement is showing what theSoldAsVacant
field would look like after the proposed changes. It uses aCASE
statement to change ‘Y’ to ‘Yes’, ‘N’ to ‘No’, and leaves any other values as they are.
Last operation used a CASE statement, which changes ‘Y’ to ‘Yes’ and ‘N’ to ‘No’. If any other value is found, it keeps the original value.
Finally, Select
statement is the same as the first one. It’s used to verify that the Update
statement worked correctly.

Removing Duplicates
Finally, we found and removed duplicate records. We identified these duplicates using the ROW_NUMBER()
function partitioned over certain fields.
WITH RowNumCTE AS(
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY ParcelID,
PropertyAddress,
SalePrice,
SaleDate,
LegalReference
ORDER BY
UniqueID
) row_num
FROM PortfolioProject.dbo.NashvilleHousing
)
DELETE FROM RowNumCTE
WHERE row_num > 1
This command partitions the data over multiple fields, assigns a row number for each partition, and deletes any row with a row number greater than 1, effectively removing duplicates.
Result:

The
WITH
clause is used to create a temporary result set namedRowNumCTE
. This result set includes all the columns of theNashvilleHousing
table, plus a new columnrow_num
.The
ROW_NUMBER()
function is used to assign a unique row number to each record in the result set. TheOVER
clause specifies that the row numbers should be assigned within groups of records that have the sameParcelID
,PropertyAddress
,SalePrice
,SaleDate
, andLegalReference
. Within each group, records are ordered byUniqueID
.So, for each group of duplicate records, the record with the smallest
UniqueID
will haverow_num
= 1, the record with the next smallestUniqueID
will haverow_num
= 2, and so on.The
Select * From RowNumCTE Where row_num > 1 Order by PropertyAddress
statement selects all records fromRowNumCTE
whererow_num
is greater than 1. These are the duplicate records.This statement is just for checking the duplicates. It doesn’t remove any records. It orders the output by
PropertyAddress
for easy reviewing.The
Select * From PortfolioProject.dbo.NashvilleHousing
statement selects all records from the originalNashvilleHousing
table. This is just to view the original table, it doesn’t change anything.
-- Delete Unused Columns
Select *
From PortfolioProject.dbo.NashvilleHousing
ALTER TABLE PortfolioProject.dbo.NashvilleHousing
DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate

Conclusion
This thorough exploration has transformed our dataset into a clean, efficient, and analysis-ready tool. This process emphasized the crucial role data cleaning plays in data analysis, unlocking the potential of a dataset and paving the way for insightful analysis. With this cleaned dataset, we are prepared to dive deep into the Nashville housing market’s intricacies in our future analyses. Stay tuned for upcoming blogs where we will utilize different analysis techniques and visualization tools on our now pristine dataset!