Skip to content

Linear Regression using Pandas #16540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lumylovepandas opened this issue May 30, 2017 · 5 comments
Closed

Linear Regression using Pandas #16540

lumylovepandas opened this issue May 30, 2017 · 5 comments
Labels
Duplicate Report Duplicate issue or pull request

Comments

@lumylovepandas
Copy link

lumylovepandas commented May 30, 2017

Code Sample from Stamford

[http://stamfordresearch.com/linear-regression-using-pandas-python/](url) (original code)

[](url)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.DataFrame ({
'length' : [94,74,147,58,86,94,63,86,69,72,128,85,82,86,88,72,74,61,90,89,68,76,114,90,78],
'weight' : [130,51,640,28,80,110,33,90,36,38,366,84,80,83,70,61,54,44,106,84,39,42,197,102,57]
})
 
# create another data frame of log values
data_log = np.log(data)


# ========================
# Model for Original Data
# ========================
 
lm_original = np.polyfit(data.length, data.weight, 1)
polynomial = np.poly1d(lm_original)
y =polynomial(data.length)

lm_original_plot=pd.DataFrame({
    'length': data.length,
    'weight': y
        })

# ========================
# Model for Log Data
# ========================
 
# Get the linear models
lm_log = np.polyfit(data_log.length, data_log.weight, 1)
 
# calculate the y values based on the co-efficients from the model
r_x, r_y = zip(*((i, i*lm_log[0] + lm_log[1]) for i in data_log.length))
 
# Put in to a data frame, to keep is all nice
lm_log_plot = pd.DataFrame({
'length' : r_x,
'weight' : r_y
})


# ========================
# Plot the data
# ========================
fig, axes = plt.subplots(nrows=1, ncols=2)
 
# Plot the original data and model
data.plot(kind='scatter', color='Blue', x='length', y='weight', ax=axes[0],title='Original Values')
lm_original_plot.plot(kind='line', color='Red', x='length', y='weight', ax=axes[0])
 
# Plot the log transformed data and model
data_log.plot(kind='scatter', color='Blue', x='length', y='weight', ax=axes[1], title='Log Values')
lm_log_plot.plot(kind='line', color='Red', x='length', y='weight', ax=axes[1])

plt.show()

Problem description

Pandas doesn't plot the expected graph as the website mention in it.
the bug is at this line:
data.plot(kind='scatter', color='Blue', x='length', y='weight', ax=axes[0],title='Original Values')
lm_original_plot.plot(kind='line', color='Red', x='length', y='weight', ax=axes[0])
The x-axis data is being flipped when pandas try to plot it.

image

Expected Output

image

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.5
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Can you simplify the problem, and narrow down what the exact bug is?

@lumylovepandas
Copy link
Author

The bug is at this line:
data.plot(kind='scatter', color='Blue', x='length', y='weight', ax=axes[0],title='Original Values')
lm_original_plot.plot(kind='line', color='Red', x='length', y='weight', ax=axes[0])

Please observe the x-axis line, the data is 90, 85, 80, the sequence is reversed. It suppose to be linear in sequence.

@TomAugspurger
Copy link
Contributor

Could you simplify the example output then? You should be able to construct a simple dataframe or series that shows the unexpected output. No need for any of the regression stuff.

I'd like the example to be as simple as possible, so we can clarify what the point of confusion is. I think pandas may be behaving as intended here, but I'd like to make sure first.

@lumylovepandas
Copy link
Author

I can't simplify the output as you wish, you need to click the link and study the example.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 30, 2017

@lumylovepandas here's a minimal example that demonstrates the problem

import pandas as pd
df = pd.DataFrame({"x": [90, 80, 85], "y": [10, 20, 30]})
ax = df.plot(x='x', y='y')

gh

Since this doesn't have any extraneous information (like data generation, regression, additional plots) it's easier to see that it's a duplicate of #10118. Could you post there if you have feedback? It'd be valuable to have additional voices there. I can see why the current output is surprising, and it's probably just an implementation detail that's exposed to the user.

@TomAugspurger TomAugspurger added the Duplicate Report Duplicate issue or pull request label May 30, 2017
@TomAugspurger TomAugspurger added this to the No action milestone May 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants