Linear Regression using Pandas #16540

lumylovepandas · 2017-05-30T17:41:03Z

Code Sample from Stamford

[http://stamfordresearch.com/linear-regression-using-pandas-python/](url) (original code)

[](url)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.DataFrame ({
'length' : [94,74,147,58,86,94,63,86,69,72,128,85,82,86,88,72,74,61,90,89,68,76,114,90,78],
'weight' : [130,51,640,28,80,110,33,90,36,38,366,84,80,83,70,61,54,44,106,84,39,42,197,102,57]
})
 
# create another data frame of log values
data_log = np.log(data)


# ========================
# Model for Original Data
# ========================
 
lm_original = np.polyfit(data.length, data.weight, 1)
polynomial = np.poly1d(lm_original)
y =polynomial(data.length)

lm_original_plot=pd.DataFrame({
    'length': data.length,
    'weight': y
        })

# ========================
# Model for Log Data
# ========================
 
# Get the linear models
lm_log = np.polyfit(data_log.length, data_log.weight, 1)
 
# calculate the y values based on the co-efficients from the model
r_x, r_y = zip(*((i, i*lm_log[0] + lm_log[1]) for i in data_log.length))
 
# Put in to a data frame, to keep is all nice
lm_log_plot = pd.DataFrame({
'length' : r_x,
'weight' : r_y
})


# ========================
# Plot the data
# ========================
fig, axes = plt.subplots(nrows=1, ncols=2)
 
# Plot the original data and model
data.plot(kind='scatter', color='Blue', x='length', y='weight', ax=axes[0],title='Original Values')
lm_original_plot.plot(kind='line', color='Red', x='length', y='weight', ax=axes[0])
 
# Plot the log transformed data and model
data_log.plot(kind='scatter', color='Blue', x='length', y='weight', ax=axes[1], title='Log Values')
lm_log_plot.plot(kind='line', color='Red', x='length', y='weight', ax=axes[1])

plt.show()

Problem description

Pandas doesn't plot the expected graph as the website mention in it.
the bug is at this line:
data.plot(kind='scatter', color='Blue', x='length', y='weight', ax=axes[0],title='Original Values')
lm_original_plot.plot(kind='line', color='Red', x='length', y='weight', ax=axes[0])
The x-axis data is being flipped when pandas try to plot it.

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.5
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-05-30T17:43:52Z

Can you simplify the problem, and narrow down what the exact bug is?

lumylovepandas · 2017-05-30T17:48:23Z

The bug is at this line:
data.plot(kind='scatter', color='Blue', x='length', y='weight', ax=axes[0],title='Original Values')
lm_original_plot.plot(kind='line', color='Red', x='length', y='weight', ax=axes[0])

Please observe the x-axis line, the data is 90, 85, 80, the sequence is reversed. It suppose to be linear in sequence.

TomAugspurger · 2017-05-30T17:53:41Z

Could you simplify the example output then? You should be able to construct a simple dataframe or series that shows the unexpected output. No need for any of the regression stuff.

I'd like the example to be as simple as possible, so we can clarify what the point of confusion is. I think pandas may be behaving as intended here, but I'd like to make sure first.

lumylovepandas · 2017-05-30T17:55:51Z

I can't simplify the output as you wish, you need to click the link and study the example.

TomAugspurger · 2017-05-30T19:40:51Z

@lumylovepandas here's a minimal example that demonstrates the problem

import pandas as pd
df = pd.DataFrame({"x": [90, 80, 85], "y": [10, 20, 30]})
ax = df.plot(x='x', y='y')

Since this doesn't have any extraneous information (like data generation, regression, additional plots) it's easier to see that it's a duplicate of #10118. Could you post there if you have feedback? It'd be valuable to have additional voices there. I can see why the current output is surprising, and it's probably just an implementation detail that's exposed to the user.

TomAugspurger closed this as completed May 30, 2017

TomAugspurger added the Duplicate Report Duplicate issue or pull request label May 30, 2017

TomAugspurger added this to the No action milestone May 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear Regression using Pandas #16540

Linear Regression using Pandas #16540

lumylovepandas commented May 30, 2017 •

edited

Loading

TomAugspurger commented May 30, 2017

lumylovepandas commented May 30, 2017

TomAugspurger commented May 30, 2017

lumylovepandas commented May 30, 2017

TomAugspurger commented May 30, 2017 •

edited

Loading

Linear Regression using Pandas #16540

Linear Regression using Pandas #16540

Comments

lumylovepandas commented May 30, 2017 • edited Loading

Code Sample from Stamford

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented May 30, 2017

lumylovepandas commented May 30, 2017

TomAugspurger commented May 30, 2017

lumylovepandas commented May 30, 2017

TomAugspurger commented May 30, 2017 • edited Loading

lumylovepandas commented May 30, 2017 •

edited

Loading

Output of `pd.show_versions()`

TomAugspurger commented May 30, 2017 •

edited

Loading