Docstring validation script issues #20318

m-dz · 2018-03-13T01:22:02Z

Problem description

I am currently working on my PR for documentation sprint (for pd.DataFrame.to_csv)) and have a few issues with the docsting validation script, please see below. Master merged ~5 minutes ago, so everything should be up to date.

"Code"

My "Parameters" section starts with:

        Parameters
        ----------
        path_or_buf : str or file handle, default None
            File path or object, if None is provided the result is returned as
            a string.
        sep : str (length 1), default ','
            Field delimiter for the output file.
        na_rep : str, default ''
            Missing data representation.

Validation script outputs:

Run right one after another, without any change to the doc.

Errors found:
        Docstring text (summary) should start in the line immediately after the opening quotes (not in the same line, or leaving a blank line in between)
        Errors in parameters section
                Parameters {'decimal', 'sep', 'compression', 'date_format', 'doublequote', 'header', 'na_rep', 'quoting', 'line_terminator', 'quotechar', 'encoding', 'tupleize_cols', 'mode', 'chunksize', 'escapechar', 'index_label', 'float_format', 'columns', 'index'} not documented
                Unknown parameters {"'``"}
                Parameter "path_or_buf" description should start with capital letter
                Parameter "path_or_buf" description should finish with "."
                Parameter "'``" has no type
                Parameter "'``" description should start with capital letter
                Parameter "'``" description should finish with "."
        No returns section found

Errors found:
        Docstring text (summary) should start in the line immediately after the opening quotes (not in the same line, or leaving a blank line in between)
        Errors in parameters section
                Parameters {'index_label', 'float_format', 'decimal', 'header', 'encoding', 'chunksize', 'na_rep', 'doublequote', 'quotechar', 'sep', 'compression', 'tupleize_cols', 'quoting', 'line_terminator', 'date_format', 'mode', 'columns', 'escapechar', 'index'} not documented
                Unknown parameters {"'``"}
                Parameter "path_or_buf" description should start with capital letter
                Parameter "path_or_buf" description should finish with "."
                Parameter "'``" has no type
                Parameter "'``" description should start with capital letter
                Parameter "'``" description should finish with "."
        No returns section found

Expected Output

Issues found:

Docstring cannot start in the new line: when this is met, I got the following error:

Traceback (most recent call last):
  File "scripts/validate_docstrings.py", line 499, in <module>
    sys.exit(main(args.function))
  File "scripts/validate_docstrings.py", line 485, in main
    return validate_one(function)
  File "scripts/validate_docstrings.py", line 411, in validate_one
    doc.summary.split(' ')[0][-1] == 's'):
IndexError: string index out of range

sep is clearly documented, as far as I can say following section-3-parameters here;
"Not documented" parameters list changes with each script run;
Unknown parameters {"'``"} is caused by line_terminator : str, default ``'\n'```, kind of fixed with line_terminator : str, default '\n'`, but this also doesn't seem to be correct;
Some other description issues like those with path_or_buf, which seem to be correct (?);
Return section seems to be optional ("If the method returns a value, it will be documented in this section." here, but the script lists this as error "No returns section found".

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: 2ec022f
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0.dev0+539.g2ec022f0d.dirty

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-03-13T01:31:34Z

We're collecting issues on the validation script in #20298

Do you have a PR open? If not, could you make a PR and we'll work through issues there? Easier to comment on the code.

m-dz · 2018-03-13T01:35:27Z

Ekhm... After some fiddling with the file
Line line_terminator : str, default ``'\n'`` changed to line_terminator : str, default '\\n' "fixed" most of the issues:

Errors found:
        Errors in parameters section
                Parameters {'decimal'} not documented
                Unknown parameters {"decimal: str, default '.'"}
                Parameter "line_terminator" description should finish with "."
                Parameter "quoting" description should start with capital letter
                Parameter "quoting" description should finish with "."
                Parameter "quotechar" description should start with capital letter
                Parameter "quotechar" description should finish with "."
                Parameter "doublequote" description should finish with "."
                Parameter "escapechar" description should start with capital letter
                Parameter "escapechar" description should finish with "."
                Parameter "chunksize" description should start with capital letter
                Parameter "chunksize" description should finish with "."
                Parameter "tupleize_cols" description should start with capital letter
                Parameter "date_format" description should finish with "."
                Parameter "decimal: str, default '.'" has no type
                Parameter "decimal: str, default '.'" description should finish with "."
        No returns section found
        Missing description for See Also "file." reference
        Missing description for See Also "a" reference

My whole docstring (with the issues above NOT yet fixed) below:

    """
    Write a DataFrame to a comma-separated values (CSV) file.

    Write a DataFrame to a comma-separated values (CSV) file with an user
    specified format (e.g. a separator, missing values representation,
    quoting, header and index specification etc.) and possible compression.

    Parameters
    ----------
    path_or_buf : str or file handle, default None
        File path or object, if None is provided the result is returned as
        a string.
    sep : str (length 1), default ','
        Field delimiter for the output file.
    na_rep : str, default ''
        Missing data representation.
    float_format : str, default None
        Format string for floating point numbers.
    columns : sequence, optional
        Columns to write.
    header : bool or list of str, default True
        Write out the column names. If a list of strings is given it is
        assumed to be aliases for the column names.
    index : bool, default True
        Write row names (index).
    index_label : str or sequence, or False, default None
        Column label for index column(s) if desired. If None is given, and
        `header` and `index` are `True`, then the index names are used. A
        sequence should be given if the DataFrame uses MultiIndex. If
        `False` do not print fields for index names.
        Use `index_label=False` for easier importing in R.
    mode : str, default 'w'
        Python write mode.
    encoding : str, optional
        A string representing the encoding to use in the output file,
        defaults to 'ascii' on Python 2 and 'utf-8' on Python 3.
    compression : str, optional
        A string representing the compression to use in the output file,
        allowed values are 'gzip', 'bz2', 'xz', only used when the first
        argument is a filename.
    line_terminator : str, default '\\n'
        The newline character or character sequence to use in the output
        file
    quoting : optional constant from csv module
        defaults to csv.QUOTE_MINIMAL. If you have set a `float_format`
        then floats are converted to strings and thus csv.QUOTE_NONNUMERIC
        will treat them as non-numeric
    quotechar : str (length 1), default '\"'
        character used to quote fields
    doublequote : bool, default True
        Control quoting of `quotechar` inside a field
    escapechar : str (length 1), default None
        character used to escape `sep` and `quotechar` when appropriate
    chunksize : int or None
        rows to write at a time
    tupleize_cols : bool, default False
        .. deprecated:: 0.21.0
           This argument will be removed and will always write each row
           of the multi-index as a separate row in the CSV file.

        Write MultiIndex columns as a list of tuples (if `True`) or in
        the new, expanded format, where each MultiIndex column is a row
        in the CSV (if `False`).
    date_format : str, default None
        Format string for datetime objects
    decimal: str, default '.'
        Character recognized as decimal separator. E.g. use ',' for
        European data

    See Also
    --------
    pandas.Series.to_csv : Write a Series to a comma-separated values (CSV)
    file.
    pandas.read_csv : Read a comma-separated values (CSV) file into
    a DataFrame.

    Examples
    --------
    Setup:

    >>> from csv import reader
    >>> from tempfile import TemporaryFile
    >>> def print_helper(temp):
    ...     # Read and print a "raw" version of the input file
    ...     # "Rewind" to the begining of the file
    ...     _ = temp.seek(0)
    ...     r = reader(temp, delimiter='X')
    ...     for row in r:
    ...         print(''.join(row))

    A simple example of writing (and reading) a CSV file:

    >>> df = pd.DataFrame({'col_a': [1, 2], 'col_b': [9, 8]},
    ...                   index=['a','b'])
    >>> df
       col_a  col_b
    a      1      9
    b      2      8
    >>> with TemporaryFile('w+') as temp:
    ...     df.to_csv(temp)
    ...     _ = temp.seek(0)
    ...     df_out = pd.read_csv(temp, sep=',', index_col=0)
    ...     print_helper(temp)
    ,col_a,col_b
    a,1,9
    b,2,8
    >>> df_out
       col_a  col_b
    a      1      9
    b      2      8

    Assert equality ignoring `dtype`

    >>> pd.testing.assert_frame_equal(df, df_out, check_dtype=False)

    **Custom formatting**

    Write a CSV file with a custom separator, missing value representation,
    and float and dates formatting:

    >>> df = pd.DataFrame({
    ...     'col_a': [1.0, 2.0],
    ...     'col_b': [0.0001, 0.01],
    ...     'date_col': pd.date_range('2018-03-10', '2018-03-11')
    ... })
    >>> df.iloc[0,0] = np.nan
    >>> df
       col_a   col_b   date_col
    0    NaN  0.0001 2018-03-10
    1    2.0  0.0100 2018-03-11
    >>> with TemporaryFile('w+') as temp:
    ...     df.to_csv(temp, sep=':', na_rep='NaNa', float_format='%.2f',
    ...               date_format='%Y/%m/%d')
    ...     _ = temp.seek(0)
    ...     df_out = pd.read_csv(temp, sep=':', na_values='NaNa',
    ...                          index_col=0, parse_dates=['date_col'])
    ...     print_helper(temp)
    :col_a:col_b:date_col
    0:NaNa:0.00:2018/03/10
    1:2.00:0.01:2018/03/11

    Note the "standard" Python NaN representation "NaN"

    >>> df_out
       col_a  col_b   date_col
    0    NaN   0.00 2018-03-10
    1    2.0   0.01 2018-03-11

    Assert equality with a rounded column to match the format used

    >>> df['col_b'] = np.round(df.col_b, 2)
    >>> pd.testing.assert_frame_equal(df, df_out, check_dtype=False)
    """

m-dz · 2018-03-13T01:37:02Z

Oh, hi @TomAugspurger , didn't see your comment. I saw #20298, but thought it's for some actual improvements, not fixing the current state. After my minor change described above the output is much cleaner, I'll fix those issues and submit a PR. In the meantime, my docstring is also above.

jorisvandenbossche · 2018-03-13T09:10:12Z

@m-dz Ah, so the problem was the newline character \n in the docstring messing up. I added that to the list of issues in #20298, therefore closing this. But thanks for reporting!

For the to_csv docstring, as @TomAugspurger said, it's easier to give feedback if you open a PR, but one point: if you don't specify a path, the "csv file" is returned a string, and can that way be printed to the output. That might be an easier alternative to the temporary file way you're using now.

m-dz · 2018-03-13T09:26:32Z

Looks like it was it, or maybe double backticks, something there. Still need to check the html doc, but I have some troubles building it (will dig more or create an issue later, but an excerpt is the --single parameter seems to not work and I am getting some errors while building).

m-dz changed the title ~~Doctring validation script issues~~ Docstring validation script issues Mar 13, 2018

jorisvandenbossche closed this as completed Mar 13, 2018

jorisvandenbossche mentioned this issue Mar 13, 2018

DOC: docstring validation script improvements #20298

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docstring validation script issues #20318

Docstring validation script issues #20318

m-dz commented Mar 13, 2018

INSTALLED VERSIONS

TomAugspurger commented Mar 13, 2018

m-dz commented Mar 13, 2018 •

edited by jorisvandenbossche

Loading

m-dz commented Mar 13, 2018 •

edited

Loading

jorisvandenbossche commented Mar 13, 2018

m-dz commented Mar 13, 2018

Docstring validation script issues #20318

Docstring validation script issues #20318

Comments

m-dz commented Mar 13, 2018

Problem description

"Code"

Validation script outputs:

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Mar 13, 2018

m-dz commented Mar 13, 2018 • edited by jorisvandenbossche Loading

m-dz commented Mar 13, 2018 • edited Loading

jorisvandenbossche commented Mar 13, 2018

m-dz commented Mar 13, 2018

Output of `pd.show_versions()`

m-dz commented Mar 13, 2018 •

edited by jorisvandenbossche

Loading

m-dz commented Mar 13, 2018 •

edited

Loading