Skip to content

Docstring validation script issues #20318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
m-dz opened this issue Mar 13, 2018 · 5 comments
Closed

Docstring validation script issues #20318

m-dz opened this issue Mar 13, 2018 · 5 comments

Comments

@m-dz
Copy link

m-dz commented Mar 13, 2018

Problem description

I am currently working on my PR for documentation sprint (for pd.DataFrame.to_csv)) and have a few issues with the docsting validation script, please see below. Master merged ~5 minutes ago, so everything should be up to date.

"Code"

My "Parameters" section starts with:

        Parameters
        ----------
        path_or_buf : str or file handle, default None
            File path or object, if None is provided the result is returned as
            a string.
        sep : str (length 1), default ','
            Field delimiter for the output file.
        na_rep : str, default ''
            Missing data representation.

Validation script outputs:

Run right one after another, without any change to the doc.

Errors found:
        Docstring text (summary) should start in the line immediately after the opening quotes (not in the same line, or leaving a blank line in between)
        Errors in parameters section
                Parameters {'decimal', 'sep', 'compression', 'date_format', 'doublequote', 'header', 'na_rep', 'quoting', 'line_terminator', 'quotechar', 'encoding', 'tupleize_cols', 'mode', 'chunksize', 'escapechar', 'index_label', 'float_format', 'columns', 'index'} not documented
                Unknown parameters {"'``"}
                Parameter "path_or_buf" description should start with capital letter
                Parameter "path_or_buf" description should finish with "."
                Parameter "'``" has no type
                Parameter "'``" description should start with capital letter
                Parameter "'``" description should finish with "."
        No returns section found
Errors found:
        Docstring text (summary) should start in the line immediately after the opening quotes (not in the same line, or leaving a blank line in between)
        Errors in parameters section
                Parameters {'index_label', 'float_format', 'decimal', 'header', 'encoding', 'chunksize', 'na_rep', 'doublequote', 'quotechar', 'sep', 'compression', 'tupleize_cols', 'quoting', 'line_terminator', 'date_format', 'mode', 'columns', 'escapechar', 'index'} not documented
                Unknown parameters {"'``"}
                Parameter "path_or_buf" description should start with capital letter
                Parameter "path_or_buf" description should finish with "."
                Parameter "'``" has no type
                Parameter "'``" description should start with capital letter
                Parameter "'``" description should finish with "."
        No returns section found

Expected Output

Issues found:

  1. Docstring cannot start in the new line: when this is met, I got the following error:
Traceback (most recent call last):
  File "scripts/validate_docstrings.py", line 499, in <module>
    sys.exit(main(args.function))
  File "scripts/validate_docstrings.py", line 485, in main
    return validate_one(function)
  File "scripts/validate_docstrings.py", line 411, in validate_one
    doc.summary.split(' ')[0][-1] == 's'):
IndexError: string index out of range
  1. sep is clearly documented, as far as I can say following section-3-parameters here;
  2. "Not documented" parameters list changes with each script run;
  3. Unknown parameters {"'``"} is caused by line_terminator : str, default ``'\n'```, kind of fixed with line_terminator : str, default '\n'`, but this also doesn't seem to be correct;
  4. Some other description issues like those with path_or_buf, which seem to be correct (?);
  5. Return section seems to be optional ("If the method returns a value, it will be documented in this section." here, but the script lists this as error "No returns section found".

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 2ec022f
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0.dev0+539.g2ec022f0d.dirty

@TomAugspurger
Copy link
Contributor

We're collecting issues on the validation script in #20298

Do you have a PR open? If not, could you make a PR and we'll work through issues there? Easier to comment on the code.

@m-dz
Copy link
Author

m-dz commented Mar 13, 2018

Ekhm... After some fiddling with the file
Line line_terminator : str, default ``'\n'`` changed to line_terminator : str, default '\\n' "fixed" most of the issues:

Errors found:
        Errors in parameters section
                Parameters {'decimal'} not documented
                Unknown parameters {"decimal: str, default '.'"}
                Parameter "line_terminator" description should finish with "."
                Parameter "quoting" description should start with capital letter
                Parameter "quoting" description should finish with "."
                Parameter "quotechar" description should start with capital letter
                Parameter "quotechar" description should finish with "."
                Parameter "doublequote" description should finish with "."
                Parameter "escapechar" description should start with capital letter
                Parameter "escapechar" description should finish with "."
                Parameter "chunksize" description should start with capital letter
                Parameter "chunksize" description should finish with "."
                Parameter "tupleize_cols" description should start with capital letter
                Parameter "date_format" description should finish with "."
                Parameter "decimal: str, default '.'" has no type
                Parameter "decimal: str, default '.'" description should finish with "."
        No returns section found
        Missing description for See Also "file." reference
        Missing description for See Also "a" reference

My whole docstring (with the issues above NOT yet fixed) below:

    """
    Write a DataFrame to a comma-separated values (CSV) file.

    Write a DataFrame to a comma-separated values (CSV) file with an user
    specified format (e.g. a separator, missing values representation,
    quoting, header and index specification etc.) and possible compression.

    Parameters
    ----------
    path_or_buf : str or file handle, default None
        File path or object, if None is provided the result is returned as
        a string.
    sep : str (length 1), default ','
        Field delimiter for the output file.
    na_rep : str, default ''
        Missing data representation.
    float_format : str, default None
        Format string for floating point numbers.
    columns : sequence, optional
        Columns to write.
    header : bool or list of str, default True
        Write out the column names. If a list of strings is given it is
        assumed to be aliases for the column names.
    index : bool, default True
        Write row names (index).
    index_label : str or sequence, or False, default None
        Column label for index column(s) if desired. If None is given, and
        `header` and `index` are `True`, then the index names are used. A
        sequence should be given if the DataFrame uses MultiIndex. If
        `False` do not print fields for index names.
        Use `index_label=False` for easier importing in R.
    mode : str, default 'w'
        Python write mode.
    encoding : str, optional
        A string representing the encoding to use in the output file,
        defaults to 'ascii' on Python 2 and 'utf-8' on Python 3.
    compression : str, optional
        A string representing the compression to use in the output file,
        allowed values are 'gzip', 'bz2', 'xz', only used when the first
        argument is a filename.
    line_terminator : str, default '\\n'
        The newline character or character sequence to use in the output
        file
    quoting : optional constant from csv module
        defaults to csv.QUOTE_MINIMAL. If you have set a `float_format`
        then floats are converted to strings and thus csv.QUOTE_NONNUMERIC
        will treat them as non-numeric
    quotechar : str (length 1), default '\"'
        character used to quote fields
    doublequote : bool, default True
        Control quoting of `quotechar` inside a field
    escapechar : str (length 1), default None
        character used to escape `sep` and `quotechar` when appropriate
    chunksize : int or None
        rows to write at a time
    tupleize_cols : bool, default False
        .. deprecated:: 0.21.0
           This argument will be removed and will always write each row
           of the multi-index as a separate row in the CSV file.

        Write MultiIndex columns as a list of tuples (if `True`) or in
        the new, expanded format, where each MultiIndex column is a row
        in the CSV (if `False`).
    date_format : str, default None
        Format string for datetime objects
    decimal: str, default '.'
        Character recognized as decimal separator. E.g. use ',' for
        European data

    See Also
    --------
    pandas.Series.to_csv : Write a Series to a comma-separated values (CSV)
    file.
    pandas.read_csv : Read a comma-separated values (CSV) file into
    a DataFrame.

    Examples
    --------
    Setup:

    >>> from csv import reader
    >>> from tempfile import TemporaryFile
    >>> def print_helper(temp):
    ...     # Read and print a "raw" version of the input file
    ...     # "Rewind" to the begining of the file
    ...     _ = temp.seek(0)
    ...     r = reader(temp, delimiter='X')
    ...     for row in r:
    ...         print(''.join(row))

    A simple example of writing (and reading) a CSV file:

    >>> df = pd.DataFrame({'col_a': [1, 2], 'col_b': [9, 8]},
    ...                   index=['a','b'])
    >>> df
       col_a  col_b
    a      1      9
    b      2      8
    >>> with TemporaryFile('w+') as temp:
    ...     df.to_csv(temp)
    ...     _ = temp.seek(0)
    ...     df_out = pd.read_csv(temp, sep=',', index_col=0)
    ...     print_helper(temp)
    ,col_a,col_b
    a,1,9
    b,2,8
    >>> df_out
       col_a  col_b
    a      1      9
    b      2      8

    Assert equality ignoring `dtype`

    >>> pd.testing.assert_frame_equal(df, df_out, check_dtype=False)

    **Custom formatting**

    Write a CSV file with a custom separator, missing value representation,
    and float and dates formatting:

    >>> df = pd.DataFrame({
    ...     'col_a': [1.0, 2.0],
    ...     'col_b': [0.0001, 0.01],
    ...     'date_col': pd.date_range('2018-03-10', '2018-03-11')
    ... })
    >>> df.iloc[0,0] = np.nan
    >>> df
       col_a   col_b   date_col
    0    NaN  0.0001 2018-03-10
    1    2.0  0.0100 2018-03-11
    >>> with TemporaryFile('w+') as temp:
    ...     df.to_csv(temp, sep=':', na_rep='NaNa', float_format='%.2f',
    ...               date_format='%Y/%m/%d')
    ...     _ = temp.seek(0)
    ...     df_out = pd.read_csv(temp, sep=':', na_values='NaNa',
    ...                          index_col=0, parse_dates=['date_col'])
    ...     print_helper(temp)
    :col_a:col_b:date_col
    0:NaNa:0.00:2018/03/10
    1:2.00:0.01:2018/03/11

    Note the "standard" Python NaN representation "NaN"

    >>> df_out
       col_a  col_b   date_col
    0    NaN   0.00 2018-03-10
    1    2.0   0.01 2018-03-11

    Assert equality with a rounded column to match the format used

    >>> df['col_b'] = np.round(df.col_b, 2)
    >>> pd.testing.assert_frame_equal(df, df_out, check_dtype=False)
    """

@m-dz
Copy link
Author

m-dz commented Mar 13, 2018

Oh, hi @TomAugspurger , didn't see your comment. I saw #20298, but thought it's for some actual improvements, not fixing the current state. After my minor change described above the output is much cleaner, I'll fix those issues and submit a PR. In the meantime, my docstring is also above.

@m-dz m-dz changed the title Doctring validation script issues Docstring validation script issues Mar 13, 2018
@jorisvandenbossche
Copy link
Member

@m-dz Ah, so the problem was the newline character \n in the docstring messing up. I added that to the list of issues in #20298, therefore closing this. But thanks for reporting!

For the to_csv docstring, as @TomAugspurger said, it's easier to give feedback if you open a PR, but one point: if you don't specify a path, the "csv file" is returned a string, and can that way be printed to the output. That might be an easier alternative to the temporary file way you're using now.

@m-dz
Copy link
Author

m-dz commented Mar 13, 2018

Looks like it was it, or maybe double backticks, something there. Still need to check the html doc, but I have some troubles building it (will dig more or create an issue later, but an excerpt is the --single parameter seems to not work and I am getting some errors while building).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants