In this post, I will compare the results of applying linear regression and k-nearest neighbors to two different datasets.
Boston Housing Prices
You can download this data from the UCI Machine Learning Repository, at https://archive.ics.uci.edu/ml/datasets/Housing. Alternatively, you can find the Boston data in CSV format at http://github.com/shivathudi/machine-learning/linear_regression_vs_kNN/boston.csv. There are 14 columns and around 500 rows.
The last column of data is the target variable that we wish to predict, which is the median value of owner-occupied homes in $1000’s. Some of the predictor variables are per capita crime rates by town, the average number of rooms per dwelling, the weighted distances to five Boston employment centres, and the proportion of owner-occupied units built prior to 1940.
Our goal is to compare the performance of linear regression versus k-nearest neighbors on this data. I train on all data except the last 50 rows, which are reserved to test my model’s performance. These are the results on the testing data:
We find that linear regression outperforms k-nearest neighbors, with a testing MSE of 10.96. Using k-nearest neighbors regression, the lowest MSE was 18.063 with k, the number of neighbors, as 114. When considering a k-range of 1 to 50 however, the lowest MSE was 22.003, with the number of neighbors as 3.
U.S. Monthly Climate Normals Data Set
You can find the U.S. Monthly Climate Normals data through data.gov, at ftp://ftp.ncdc.noaa.gov/pub/data/normals/1981-2010/products/hourly/
The file hly-temp-normal.txt contains the 20-year mean temperatures for each of the 457 weather stations in the USA. Each line of the file is space-separated, formatted as shown in Table 1.
There are a few things to note:
- Most temperature readings are followed by a character (eg. “C” for “Confirmed”). I ignore this trailing character.
- Missing or unavailable temperature readings are flagged as “-9999” or similar. I imputed these values by using the mean temperature across all stations for that hour. A better strategy would be to replace these values with the previous temperature readings for the previous hour at that same station.
For my testing set, I used the stations USW00023234, USW00014918, USW00012919, USW00013743 and USW00025309. I get the following results:
We see that for this particular dataset, linear regression once again outperforms k-nearest neighbors.
You can find my code here.