SQL Data Cleaning Project using Housing Dataset

vw beetle, volkswagen, classic car-405876.jpg

Introduction

In every data analysis project, the first and often most crucial step is data cleaning. In this blog post, we will comprehensively explore the cleaning process of the Nashville housing dataset. This dataset, boasting 56,477 rows, offers a rich variety of property sales information. But before we can unlock any insights, we must first ensure the data is clean and accessible. So, let’s dive in!

Data Import

Before we start cleaning, we need to import our dataset. We imported the excel file into Microsoft SQL Server Management Studio. Our data is now stored in a SQL database. Therefore, we used a SQL SELECT query to retrieve it:


SELECT *
FROM PortfolioProject.dbo.NashvilleHousing

This command pulls all the columns from the NashvilleHousing table in the PortfolioProject database.

In-depth Data Cleaning

Converting the SaleDate field

The SaleDate field was initially a string datatype. However, to facilitate date-based operations, we need to convert it to a Date datatype. Our first attempt looked like this:


UPDATE NashvilleHousing
SET SaleDate = CONVERT(Date,SaleDate)

This attempted to directly convert the SaleDate string into a Date format. However, due to SQL Server constraints, which disallow changing the datatype of a column in-place, this operation failed. The workaround was to create a new column and populate it with the converted dates:


ALTER TABLE NashvilleHousing
ADD SaleDateConverted Date;

UPDATE NashvilleHousing
SET SaleDateConverted = CONVERT(Date,SaleDate)

-- Checking if it is added to the table

Select SaleDate, SaleDateConverted
From PortfolioProject.dbo.NashvilleHousing

Result:

The ALTER TABLE command added a new SaleDateConverted column, and the UPDATE command filled this column with SaleDate data, converted to Date format.

Populating missing PropertyAddress data

Our dataset had missing PropertyAddress values. We tackled this by utilizing the ISNULL function and a JOIN operation to populate missing data from other records that shared the same ParcelID.


Select *
From PortfolioProject.dbo.NashvilleHousing
Where PropertyAddress is null
order by ParcelID

Select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress,b.PropertyAddress)
From PortfolioProject.dbo.NashvilleHousing a
JOIN PortfolioProject.dbo.NashvilleHousing b
	on a.ParcelID = b.ParcelID
	AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is null

--Now we update

UPDATE a
SET PropertyAddress = ISNULL(a.PropertyAddress,b.PropertyAddress)
FROM PortfolioProject.dbo.NashvilleHousing a
JOIN PortfolioProject.dbo.NashvilleHousing b
	ON a.ParcelID = b.ParcelID
	AND a.[UniqueID ] <> b.[UniqueID ]
WHERE a.PropertyAddress is null

--verify the changes
Select UniqueID, ParcelID, PropertyAddress
From PortfolioProject.dbo.NashvilleHousing

Result:

SELECT statement performs a self-join on the NashvilleHousing table (meaning the table is joined with itself). This is done on the condition that ParcelID of table a equals ParcelID of table b and UniqueID of table a is not equal to UniqueID of table b. This is used to find rows in the same table with the same ParcelID but different UniqueID, where PropertyAddress is null in table a. It selects columns from both versions of the table and also applies the ISNULL function to a.PropertyAddress, which will return b.PropertyAddress if a.PropertyAddress is null.

The UPDATE statement is similar to the SELECT statement above, but instead of just selecting, it updates PropertyAddress in table a with the PropertyAddress from table b, if PropertyAddress in table a is null. This effectively fills null PropertyAddress entries with PropertyAddress from other entries with the same ParcelID.The ISNULL function checks if the first argument (a.PropertyAddress) is NULL. If true, it replaces it with the second argument (b.PropertyAddress), essentially filling the missing data from corresponding entries with the same ParcelID.

Breaking out Address into Individual Columns

The PropertyAddress field bundled multiple data pieces—Address, City, and State. To facilitate easy analysis, we decided to split this into separate fields using the SUBSTRING and CHARINDEX functions.


ALTER TABLE NashvilleHousing
ADD PropertySplitAddress Nvarchar(255);
UPDATE NashvilleHousing
SET PropertySplitAddress = SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1 );

ALTER TABLE NashvilleHousing
ADD PropertySplitCity Nvarchar(255);
UPDATE NashvilleHousing
SET PropertySplitCity = SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1 , LEN(PropertyAddress))

Result:

The SUBSTRING function extracts a part of the string, and CHARINDEX finds the position of the comma which separates address and city in the PropertyAddress field. We created two new fields: PropertySplitAddress and PropertySplitCity to store these extracted parts.

Reformatting the “Sold as Vacant” field

In the SoldAsVacant field, we had ‘Y’ and ‘N’ values, which we transformed to ‘Yes’ and ‘No’, respectively, for clarity.


Select Distinct(SoldAsVacant), Count(SoldAsVacant)
From PortfolioProject.dbo.NashvilleHousing
Group by SoldAsVacant
order by 2


Select SoldAsVacant
, CASE When SoldAsVacant = 'Y' THEN 'Yes'
	   When SoldAsVacant = 'N' THEN 'No'
	   ELSE SoldAsVacant
	   END
From PortfolioProject.dbo.NashvilleHousing


Update NashvilleHousing
SET SoldAsVacant = CASE When SoldAsVacant = 'Y' THEN 'Yes'
	   When SoldAsVacant = 'N' THEN 'No'
	   ELSE SoldAsVacant
	   END

Select Distinct(SoldAsVacant), Count(SoldAsVacant)
From PortfolioProject.dbo.NashvilleHousing
Group by SoldAsVacant
order by 2

  1. Select Distinct(SoldAsVacant), Count(SoldAsVacant) From PortfolioProject.dbo.NashvilleHousing Group by SoldAsVacant order by 2: This first SQL statement is finding the distinct values of SoldAsVacant and the count of each distinct value. The result is ordered by the count (the second column).

  2. The second Select statement is showing what the SoldAsVacant field would look like after the proposed changes. It uses a CASE statement to change ‘Y’ to ‘Yes’, ‘N’ to ‘No’, and leaves any other values as they are.

Last operation used a CASE statement, which changes ‘Y’ to ‘Yes’ and ‘N’ to ‘No’. If any other value is found, it keeps the original value.

 Finally, Select statement is the same as the first one. It’s used to verify that the Update statement worked correctly.

Removing Duplicates

Finally, we found and removed duplicate records. We identified these duplicates using the ROW_NUMBER() function partitioned over certain fields.


WITH RowNumCTE AS(
SELECT *,
	ROW_NUMBER() OVER (
	PARTITION BY ParcelID,
				 PropertyAddress,
				 SalePrice,
				 SaleDate,
				 LegalReference
				 ORDER BY
					UniqueID
					) row_num
FROM PortfolioProject.dbo.NashvilleHousing
)
DELETE FROM RowNumCTE
WHERE row_num > 1

This command partitions the data over multiple fields, assigns a row number for each partition, and deletes any row with a row number greater than 1, effectively removing duplicates.

Result:

  1. The WITH clause is used to create a temporary result set named RowNumCTE. This result set includes all the columns of the NashvilleHousing table, plus a new column row_num.

    The ROW_NUMBER() function is used to assign a unique row number to each record in the result set. The OVER clause specifies that the row numbers should be assigned within groups of records that have the same ParcelID, PropertyAddress, SalePrice, SaleDate, and LegalReference. Within each group, records are ordered by UniqueID.

    So, for each group of duplicate records, the record with the smallest UniqueID will have row_num = 1, the record with the next smallest UniqueID will have row_num = 2, and so on.

  2. The Select * From RowNumCTE Where row_num > 1 Order by PropertyAddress statement selects all records from RowNumCTE where row_num is greater than 1. These are the duplicate records.

    This statement is just for checking the duplicates. It doesn’t remove any records. It orders the output by PropertyAddress for easy reviewing.

  3. The Select * From PortfolioProject.dbo.NashvilleHousing statement selects all records from the original NashvilleHousing table. This is just to view the original table, it doesn’t change anything.


-- Delete Unused Columns


Select *
From PortfolioProject.dbo.NashvilleHousing


ALTER TABLE PortfolioProject.dbo.NashvilleHousing
DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate

Conclusion

This thorough exploration has transformed our dataset into a clean, efficient, and analysis-ready tool. This process emphasized the crucial role data cleaning plays in data analysis, unlocking the potential of a dataset and paving the way for insightful analysis. With this cleaned dataset, we are prepared to dive deep into the Nashville housing market’s intricacies in our future analyses. Stay tuned for upcoming blogs where we will utilize different analysis techniques and visualization tools on our now pristine dataset!