Introduction
In this project, I decided to discover patterns in the housing prices of California. Using a dataset of the housing prices in California’s districts, I aimed to use clustering to identify if there were any separating features within the dataset. I used k-means clustering to group the districts based on their population, prices, and house descriptions.
Clustering
K-means clustering is a method of clustering that uses quantitative data to identify patterns by creating groups of similar data-points. The model creates k amount of clusters based on the value of your choosing. After choosing your number of clusters, the model establishes centroids that act as the center of the clusters. The model then goes through your data and assigns each point to its most similar centroid. The model repeats this process until the data points stop moving.
The Dataset
https://www.kaggle.com/datasets/camnugent/california-housing-prices
This Kaggle dataset contains information about housing districts within California. The features include the median house age, the total amount of rooms, the total amount of bedrooms, the population, the total number of households, the median income, the median house value, and location information.
Data Understanding
To check for collinearities, I created a correlation matrix of the dataset’s features. The matrix showed that features such as total rooms, households, and population all had high correlations, but I decided that these features had information that couldn’t be dropped. The matrix also showed that the locational information, such as longitude and latitude, had high correlation with one another. I decided to drop these features because I felt the location of the district would interfere with any insights. If we were to group the districts with location, the districts would just be placed into clusters based on their physical distance from each other.
Data Preprocessing and Modeling
First, I dropped the longitude, latitude, and ocean proximity from the dataset. I did this to remove the physical distance of the districts from being a factor, and because ocean proximity was not a quantitative feature. Next, I found the average amount of rooms and bedrooms of each district by dividing the total amount of rooms and bedrooms by the total amount of households and added them to the dataset. I did this so that the dataset would have more information on individual houses within the district. Then, I normalized the dataset so that outlying districts wouldn’t affect the model negatively. Afterwards, I checked the data for any null values and dropped the bedroom features because they had null values. I dropped these features instead of replacing the null values because I felt that the room features gave enough information about the rooms within a house. Finally, I applied PCA to summarize the features, and to make the modeling more efficient. To choose the number of components for PCA, I plotted the explained variance against the number of components. This led me towards choosing 2 components for PCA because it was the lowest amount of components with a sufficient amount of explained variance. After the preprocessing, I ran the k-means model with different amounts of clusters and used the elbow method to choose k. I also used the different amounts’ silhouette score to make sure I choose a sufficient k. Using the elbow method and silhouette scores, I chose 2 clusters for the model.
Model Insights
Once the model assigned each district its cluster, I created boxplots for each of the features separated by the clusters. The boxplots showed that the first cluster had districts with older houses, as well as districts with lower populations and households. Conversely, the second cluster had newer houses and a higher population with more households. There were very slight differences between each cluster’s income and house value.
Impact
Using this model, we could easily identify the housing needs of each district within California. Districts that fall within the first cluster could be classified as districts that need housing built. Building more houses within these districts could result in the population of each district being distributed more evenly, but this model fails to determine whether this is actually plausible. Theses districts could have conditions that restrict the building of houses, and this model is unable to determine those conditions. The upside to this model is that it allows us to narrow down the amount of districts that need new housing.


