Skip to content

possible error in documentation #26865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
karpanGit opened this issue Jun 15, 2019 · 12 comments · Fixed by #26891
Closed

possible error in documentation #26865

karpanGit opened this issue Jun 15, 2019 · 12 comments · Fixed by #26891

Comments

@karpanGit
Copy link

karpanGit commented Jun 15, 2019

The page

https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#string-regular-expression-replacement

seems to have an example that can be improved. The page lists

df.replace([r'\.', r'(a)'], ['dot', '\1stuff'], regex=True)

as an example. However '\1' is ignored because the replacement regex is not a raw string. I think what you mean is likely

df.replace([r'\.', r'(a)'], ['dot', r'\1stuff'], regex=True)

if my understanding is correct please consider updating the page.

Regards,

Panos Karamertzanis

@topper-123
Copy link
Contributor

Thanks.. You are probably right (don`t have the computer right here to double check). A contribution on this would be welcomed.

@Kischy
Copy link
Contributor

Kischy commented Jun 16, 2019

From what I tested the line should be
df.replace([r'\.', r'(a)'], ['dot', 'stuff'], regex=True)

Only then I get the output the documentation requires:

import numpy as np
import pandas as pd


#pd.show_versions()


d = {'a': list(range(4)), 'b': list('ab..'), 'c': ['a', 'b', np.nan, 'd']}
df = pd.DataFrame(d)

#print(df)
#print("-------------")
print(df.replace([r'\.', r'(a)'], ['dot', 'stuff'], regex=True))

Output:

   a      b      c
0  0  stuff  stuff
1  1      b      b
2  2    dot    NaN
3  3    dot      d

@topper-123
Copy link
Contributor

Both versions work, but the original example was meant to show a regex -> regex grouped replacement, so if you just make the string a raw string as you originally suggested, that will fix the error.

@Kischy
Copy link
Contributor

Kischy commented Jun 16, 2019

@topper-123
If I do it as originally sugested, than the output is

   a       b       c
0  0  astuff  astuff
1  1       b       b
2  2     dot     NaN
3  3     dot       d

Is it wanted that the character 'a' is in there in the second line, third and fourth word, of the output?

Code:

import numpy as np
import pandas as pd

d = {'a': list(range(4)), 'b': list('ab..'), 'c': ['a', 'b', np.nan, 'd']}
df = pd.DataFrame(d)

print(df.replace([r'\.', r'(a)'], ['dot', r'\1stuff'], regex=True))

@karpanGit
Copy link
Author

karpanGit commented Jun 16, 2019

In my view the example intends to demonstrate the regex -> regex transformation and at the same time show how to use capturing brackets in the regular expression. The original dataframe is

d = {'a': list(range(4)), 'b': list('ab..'), 'c': ['a', 'b', np.nan, 'd']}
df = pd.DataFrame(d)

i.e. the original data frame is

a b c
0 0 a a
1 1 b b
2 2 . NaN
3 3 . d

with the intended example

df.replace([r'\.', r'(a)'], ['dot', r'\1stuff'], regex=True)

we would like to replace '.' with 'dot' and also replace 'a' with 'astuff'. Indeed, the above code does exactly this and yields:

a b c
0 0 astuff astuff
1 1 b b
2 2 dot NaN
3 3 dot d

that is what the example intends to show.

@topper-123
Copy link
Contributor

Agree with @karpanGit on the example's intention.

Having said that, the used strings have no meaning, it you want to find an example where the strings/regex operation give better meaning, that could help people understand the example better.

@karpanGit
Copy link
Author

I agree with you @topper-123.

One naive question: how does it work with improving the documentation? Are users like myself supposed to make concrete proposals or only report observations?

Many thanks and apologies for my ignorance on how things work.

@topper-123
Copy link
Contributor

topper-123 commented Jun 16, 2019

If you asking specifically about the process on how an issue is resolved in pandas, then it's all on volunteer basis, and no one is obliged to fix a bug that you've reported. So in practice the best way to get things fixed is to submit a pull request yourself, including to the dcumentation :-). And contributions are always welcome, as I mentioned.

@Kischy
Copy link
Contributor

Kischy commented Jun 16, 2019

Okay, if that is intended, than the correct line is

df.replace([r'\.', r'(a)'], ['dot', r'\1stuff'], regex=True)

@topper-123
Copy link
Contributor

Yes, that`s right.

@Kischy
Copy link
Contributor

Kischy commented Jun 16, 2019

Perfect, is the pull request correct the way I did it?

@topper-123
Copy link
Contributor

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants