python logo

linear regression datasets csv python


Python hosting: Host, run, and code Python in the cloud!

How does regression, particularly linear regression, play a role in machine learning?
Given a set of data, the objective is to identify the most suitable fit line. With this line determined, predictions become feasible.
Let’s use a practical example: housing data. This dataset comprises variables like price, size, presence of a driveway, among others. The dataset relevant to this article can be downloaded here.
Essentially, any data extracted from Excel and saved in CSV format can be processed. For our purposes, we’ll employ Python’s Pandas to import the dataset.

Related Courses:

Prerequisites:
For this exercise, ensure you have the following Python modules installed:

1
2
3
sudo pip install sklearn
sudo pip install scipy
sudo pip install scikit-learn

Visualizing the Dataset:
Although picking a graphical toolkit is subjective, you can use the following optional line:

1
matplotlib.use('GTKAgg')

Initiating this task requires importing necessary modules and the dataset. Good predictions rely heavily on a robust dataset.
The initial phase involves importing the dataset. We’ll employ Python’s Pandas for this, which offers a data analysis toolset. The data is subsequently loaded into a data structure known as a Pandas DataFrame. This structure supports manipulations on both row and column levels.
For our exercise, we’ll construct two arrays: X (representing size) and Y (representing price). Naturally, one would hypothesize a correlation between house sizes and their prices.
The dataset is divided into training and test subsets. With the test data at hand, we can determine the optimal fit line and proceed with making predictions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import matplotlib
matplotlib.use('GTKAgg')
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
import pandas as pd

# Importing CSV and defining columns
df = pd.read_csv("Housing.csv")

Y = df['price']
X = df['lotsize']

X=X.reshape(len(X),1)
Y=Y.reshape(len(Y),1)

# Segmenting the data into training and test subsets
X_train = X[:-250]
X_test = X[-250:]

Y_train = Y[:-250]
Y_test = Y[-250:]

# Visualizing the output
plt.scatter(X_test, Y_test, color='black')
plt.title('Housing Test Data')
plt.xlabel('House Size')
plt.ylabel('House Price')
plt.xticks(())
plt.yticks(())

plt.show()

With the test data visualized, we’ll proceed to derive the best fit line.

1
2
3
4
5
6
7
8
# Initializing the linear regression model
regr = linear_model.LinearRegression()

# Training the model using the training subset
regr.fit(X_train, Y_train)

# Visualizing the outputs
plt.plot(X_test, regr.predict(X_test), color='red', linewidth=3)

This visualization displays the best fit line for the test data subset.

If you wish to execute an individual prediction using the linear regression model, use the following command:

1
print( str(round(regr.predict(5000))) )

For more detailed examples and to explore the course further, click here.

Navigate the tutorial using the links below:
← Previous Topic
Next Topic →






Leave a Reply: