Skip to content

Commit b8740a1

Browse files
itholicviirya
authored andcommitted
[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes
### What changes were proposed in this pull request? This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis. By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules. ### Why are the changes needed? This can be reduces the cost of static analysis during development. It has been used continuously for about a year in the Koalas project and its convenience has been proven. ### Does this PR introduce _any_ user-facing change? No, it's dev-only. ### How was this patch tested? Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed. Closes #32779 from itholic/SPARK-35499. Authored-by: itholic <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>
1 parent d4e32c8 commit b8740a1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+428
-228
lines changed

.github/workflows/build_and_test.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -366,7 +366,7 @@ jobs:
366366
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
367367
# Jinja2 3.0.0+ causes error when building with Sphinx.
368368
# See also https://issues.apache.org/jira/browse/SPARK-35375.
369-
python3.6 -m pip install flake8 pydata_sphinx_theme mypy numpydoc 'jinja2<3.0.0'
369+
python3.6 -m pip install flake8 pydata_sphinx_theme mypy numpydoc 'jinja2<3.0.0' 'black==21.5b2'
370370
- name: Install R linter dependencies and SparkR
371371
run: |
372372
apt-get install -y libcurl4-openssl-dev libgit2-dev libssl-dev libxml2-dev

dev/lint-python

+32
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ MINIMUM_PYCODESTYLE="2.7.0"
2525

2626
PYTHON_EXECUTABLE="python3"
2727

28+
BLACK_BUILD="$PYTHON_EXECUTABLE -m black"
29+
2830
function satisfies_min_version {
2931
local provided_version="$1"
3032
local expected_version="$2"
@@ -185,6 +187,35 @@ flake8 checks failed."
185187
fi
186188
}
187189

190+
function black_test {
191+
local BLACK_REPORT=
192+
local BLACK_STATUS=
193+
194+
# Skip check if black is not installed.
195+
$BLACK_BUILD 2> /dev/null
196+
if [ $? -ne 0 ]; then
197+
echo "The $BLACK_BUILD command was not found. Skipping black checks for now."
198+
echo
199+
return
200+
fi
201+
202+
echo "starting black test..."
203+
# Black is only applied for pandas API on Spark for now.
204+
BLACK_REPORT=$( ($BLACK_BUILD python/pyspark/pandas --line-length 100 --check ) 2>&1)
205+
BLACK_STATUS=$?
206+
207+
if [ "$BLACK_STATUS" -ne 0 ]; then
208+
echo "black checks failed:"
209+
echo "$BLACK_REPORT"
210+
echo "Please run 'dev/reformat-python' script."
211+
echo "$BLACK_STATUS"
212+
exit "$BLACK_STATUS"
213+
else
214+
echo "black checks passed."
215+
echo
216+
fi
217+
}
218+
188219
SCRIPT_DIR="$( cd "$( dirname "$0" )" && pwd )"
189220
SPARK_ROOT_DIR="$(dirname "${SCRIPT_DIR}")"
190221

@@ -194,6 +225,7 @@ pushd "$SPARK_ROOT_DIR" &> /dev/null
194225
PYTHON_SOURCE="$(find . -path ./docs/.local_ruby_bundle -prune -false -o -name "*.py")"
195226

196227
compile_python_test "$PYTHON_SOURCE"
228+
black_test
197229
pycodestyle_test "$PYTHON_SOURCE"
198230
flake8_test
199231
mypy_test

dev/reformat-python

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
#!/usr/bin/env bash
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one or more
4+
# contributor license agreements. See the NOTICE file distributed with
5+
# this work for additional information regarding copyright ownership.
6+
# The ASF licenses this file to You under the Apache License, Version 2.0
7+
# (the "License"); you may not use this file except in compliance with
8+
# the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
18+
# The current directory of the script.
19+
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
20+
FWDIR="$( cd "$DIR"/.. && pwd )"
21+
cd "$FWDIR"
22+
23+
BLACK_BUILD="python -m black"
24+
BLACK_VERSION="21.5b2"
25+
$BLACK_BUILD 2> /dev/null
26+
if [ $? -ne 0 ]; then
27+
echo "The '$BLACK_BUILD' command was not found. Please install Black, for example, via 'pip install black==$BLACK_VERSION'."
28+
exit 1
29+
fi
30+
31+
# This script is only applied for pandas API on Spark for now.
32+
$BLACK_BUILD python/pyspark/pandas --line-length 100

dev/requirements.txt

+3
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,6 @@ sphinx-plotly-directive
3232
# Development scripts
3333
jira
3434
PyGithub
35+
36+
# pandas API on Spark Code formatter.
37+
black

dev/tox.ini

+3-1
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,13 @@
1414
# limitations under the License.
1515

1616
[pycodestyle]
17-
ignore=E226,E241,E305,E402,E722,E731,E741,W503,W504
17+
ignore=E203,E226,E241,E305,E402,E722,E731,E741,W503,W504
1818
max-line-length=100
1919
exclude=*/target/*,python/pyspark/cloudpickle/*.py,shared.py,python/docs/source/conf.py,work/*/*.py,python/.eggs/*,dist/*,.git/*
2020

2121
[flake8]
2222
select = E901,E999,F821,F822,F823,F401,F405,B006
23+
# Ignore F821 for plot documents in pandas API on Spark.
24+
ignore = F821
2325
exclude = python/docs/build/html/*,*/target/*,python/pyspark/cloudpickle/*.py,shared.py*,python/docs/source/conf.py,work/*/*.py,python/.eggs/*,dist/*,.git/*,python/out,python/pyspark/sql/pandas/functions.pyi,python/pyspark/sql/column.pyi,python/pyspark/worker.pyi,python/pyspark/java_gateway.pyi
2426
max-line-length = 100

python/pyspark/pandas/accessors.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@
4848

4949

5050
class PandasOnSparkFrameMethods(object):
51-
""" pandas-on-Spark specific features for DataFrame. """
51+
"""pandas-on-Spark specific features for DataFrame."""
5252

5353
def __init__(self, frame: "DataFrame"):
5454
self._psdf = frame
@@ -696,7 +696,7 @@ def pandas_frame_func(f, field_name):
696696

697697

698698
class PandasOnSparkSeriesMethods(object):
699-
""" pandas-on-Spark specific features for Series. """
699+
"""pandas-on-Spark specific features for Series."""
700700

701701
def __init__(self, series: "Series"):
702702
self._psser = series

python/pyspark/pandas/base.py

+1-3
Original file line numberDiff line numberDiff line change
@@ -1068,9 +1068,7 @@ def notnull(self) -> Union["Series", "Index"]:
10681068

10691069
if isinstance(self, MultiIndex):
10701070
raise NotImplementedError("notna is not defined for MultiIndex")
1071-
return (~self.isnull()).rename(
1072-
self.name # type: ignore
1073-
)
1071+
return (~self.isnull()).rename(self.name) # type: ignore
10741072

10751073
notna = notnull
10761074

python/pyspark/pandas/config.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -381,7 +381,7 @@ def _check_option(key: str) -> None:
381381

382382

383383
class DictWrapper:
384-
""" provide attribute-style access to a nested dict"""
384+
"""provide attribute-style access to a nested dict"""
385385

386386
def __init__(self, d: Dict[str, Option], prefix: str = ""):
387387
object.__setattr__(self, "d", d)

python/pyspark/pandas/data_type_ops/base.py

+3-6
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,7 @@
4545
from pyspark.pandas.series import Series # noqa: F401 (SPARK-34943)
4646

4747

48-
def is_valid_operand_for_numeric_arithmetic(
49-
operand: Any,
50-
*,
51-
allow_bool: bool = True
52-
) -> bool:
48+
def is_valid_operand_for_numeric_arithmetic(operand: Any, *, allow_bool: bool = True) -> bool:
5349
"""Check whether the operand is valid for arithmetic operations against numerics."""
5450
if isinstance(operand, numbers.Number) and not isinstance(operand, bool):
5551
return True
@@ -58,7 +54,8 @@ def is_valid_operand_for_numeric_arithmetic(
5854
return False
5955
else:
6056
return isinstance(operand.spark.data_type, NumericType) or (
61-
allow_bool and isinstance(operand.spark.data_type, BooleanType))
57+
allow_bool and isinstance(operand.spark.data_type, BooleanType)
58+
)
6259
else:
6360
return False
6461

python/pyspark/pandas/data_type_ops/binary_ops.py

+5-3
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ class BinaryOps(DataTypeOps):
3434

3535
@property
3636
def pretty_name(self) -> str:
37-
return 'binaries'
37+
return "binaries"
3838

3939
def add(self, left, right) -> Union["Series", "Index"]:
4040
if isinstance(right, IndexOpsMixin) and isinstance(right.spark.data_type, BinaryType):
@@ -43,11 +43,13 @@ def add(self, left, right) -> Union["Series", "Index"]:
4343
return column_op(F.concat)(left, F.lit(right))
4444
else:
4545
raise TypeError(
46-
"Concatenation can not be applied to %s and the given type." % self.pretty_name)
46+
"Concatenation can not be applied to %s and the given type." % self.pretty_name
47+
)
4748

4849
def radd(self, left, right) -> Union["Series", "Index"]:
4950
if isinstance(right, bytes):
5051
return left._with_new_scol(F.concat(F.lit(right), left.spark.column))
5152
else:
5253
raise TypeError(
53-
"Concatenation can not be applied to %s and the given type." % self.pretty_name)
54+
"Concatenation can not be applied to %s and the given type." % self.pretty_name
55+
)

python/pyspark/pandas/data_type_ops/boolean_ops.py

+29-15
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,13 @@ class BooleanOps(DataTypeOps):
3838

3939
@property
4040
def pretty_name(self) -> str:
41-
return 'booleans'
41+
return "booleans"
4242

4343
def add(self, left, right) -> Union["Series", "Index"]:
4444
if not is_valid_operand_for_numeric_arithmetic(right, allow_bool=False):
4545
raise TypeError(
46-
"Addition can not be applied to %s and the given type." % self.pretty_name)
46+
"Addition can not be applied to %s and the given type." % self.pretty_name
47+
)
4748

4849
if isinstance(right, numbers.Number) and not isinstance(right, bool):
4950
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
@@ -56,7 +57,8 @@ def add(self, left, right) -> Union["Series", "Index"]:
5657
def sub(self, left, right) -> Union["Series", "Index"]:
5758
if not is_valid_operand_for_numeric_arithmetic(right, allow_bool=False):
5859
raise TypeError(
59-
"Subtraction can not be applied to %s and the given type." % self.pretty_name)
60+
"Subtraction can not be applied to %s and the given type." % self.pretty_name
61+
)
6062
if isinstance(right, numbers.Number) and not isinstance(right, bool):
6163
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
6264
return left - right
@@ -68,7 +70,8 @@ def sub(self, left, right) -> Union["Series", "Index"]:
6870
def mul(self, left, right) -> Union["Series", "Index"]:
6971
if not is_valid_operand_for_numeric_arithmetic(right, allow_bool=False):
7072
raise TypeError(
71-
"Multiplication can not be applied to %s and the given type." % self.pretty_name)
73+
"Multiplication can not be applied to %s and the given type." % self.pretty_name
74+
)
7275
if isinstance(right, numbers.Number) and not isinstance(right, bool):
7376
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
7477
return left * right
@@ -80,7 +83,8 @@ def mul(self, left, right) -> Union["Series", "Index"]:
8083
def truediv(self, left, right) -> Union["Series", "Index"]:
8184
if not is_valid_operand_for_numeric_arithmetic(right, allow_bool=False):
8285
raise TypeError(
83-
"True division can not be applied to %s and the given type." % self.pretty_name)
86+
"True division can not be applied to %s and the given type." % self.pretty_name
87+
)
8488
if isinstance(right, numbers.Number) and not isinstance(right, bool):
8589
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
8690
return left / right
@@ -92,7 +96,8 @@ def truediv(self, left, right) -> Union["Series", "Index"]:
9296
def floordiv(self, left, right) -> Union["Series", "Index"]:
9397
if not is_valid_operand_for_numeric_arithmetic(right, allow_bool=False):
9498
raise TypeError(
95-
"Floor division can not be applied to %s and the given type." % self.pretty_name)
99+
"Floor division can not be applied to %s and the given type." % self.pretty_name
100+
)
96101
if isinstance(right, numbers.Number) and not isinstance(right, bool):
97102
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
98103
return left // right
@@ -104,7 +109,8 @@ def floordiv(self, left, right) -> Union["Series", "Index"]:
104109
def mod(self, left, right) -> Union["Series", "Index"]:
105110
if not is_valid_operand_for_numeric_arithmetic(right, allow_bool=False):
106111
raise TypeError(
107-
"Modulo can not be applied to %s and the given type." % self.pretty_name)
112+
"Modulo can not be applied to %s and the given type." % self.pretty_name
113+
)
108114
if isinstance(right, numbers.Number) and not isinstance(right, bool):
109115
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
110116
return left % right
@@ -116,7 +122,8 @@ def mod(self, left, right) -> Union["Series", "Index"]:
116122
def pow(self, left, right) -> Union["Series", "Index"]:
117123
if not is_valid_operand_for_numeric_arithmetic(right, allow_bool=False):
118124
raise TypeError(
119-
"Exponentiation can not be applied to %s and the given type." % self.pretty_name)
125+
"Exponentiation can not be applied to %s and the given type." % self.pretty_name
126+
)
120127
if isinstance(right, numbers.Number) and not isinstance(right, bool):
121128
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
122129
return left ** right
@@ -131,52 +138,59 @@ def radd(self, left, right) -> Union["Series", "Index"]:
131138
return right + left
132139
else:
133140
raise TypeError(
134-
"Addition can not be applied to %s and the given type." % self.pretty_name)
141+
"Addition can not be applied to %s and the given type." % self.pretty_name
142+
)
135143

136144
def rsub(self, left, right) -> Union["Series", "Index"]:
137145
if isinstance(right, numbers.Number) and not isinstance(right, bool):
138146
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
139147
return right - left
140148
else:
141149
raise TypeError(
142-
"Subtraction can not be applied to %s and the given type." % self.pretty_name)
150+
"Subtraction can not be applied to %s and the given type." % self.pretty_name
151+
)
143152

144153
def rmul(self, left, right) -> Union["Series", "Index"]:
145154
if isinstance(right, numbers.Number) and not isinstance(right, bool):
146155
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
147156
return right * left
148157
else:
149158
raise TypeError(
150-
"Multiplication can not be applied to %s and the given type." % self.pretty_name)
159+
"Multiplication can not be applied to %s and the given type." % self.pretty_name
160+
)
151161

152162
def rtruediv(self, left, right) -> Union["Series", "Index"]:
153163
if isinstance(right, numbers.Number) and not isinstance(right, bool):
154164
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
155165
return right / left
156166
else:
157167
raise TypeError(
158-
"True division can not be applied to %s and the given type." % self.pretty_name)
168+
"True division can not be applied to %s and the given type." % self.pretty_name
169+
)
159170

160171
def rfloordiv(self, left, right) -> Union["Series", "Index"]:
161172
if isinstance(right, numbers.Number) and not isinstance(right, bool):
162173
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
163174
return right // left
164175
else:
165176
raise TypeError(
166-
"Floor division can not be applied to %s and the given type." % self.pretty_name)
177+
"Floor division can not be applied to %s and the given type." % self.pretty_name
178+
)
167179

168180
def rpow(self, left, right) -> Union["Series", "Index"]:
169181
if isinstance(right, numbers.Number) and not isinstance(right, bool):
170182
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
171183
return right ** left
172184
else:
173185
raise TypeError(
174-
"Exponentiation can not be applied to %s and the given type." % self.pretty_name)
186+
"Exponentiation can not be applied to %s and the given type." % self.pretty_name
187+
)
175188

176189
def rmod(self, left, right) -> Union["Series", "Index"]:
177190
if isinstance(right, numbers.Number) and not isinstance(right, bool):
178191
left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right))))
179192
return right % left
180193
else:
181194
raise TypeError(
182-
"Modulo can not be applied to %s and the given type." % self.pretty_name)
195+
"Modulo can not be applied to %s and the given type." % self.pretty_name
196+
)

python/pyspark/pandas/data_type_ops/categorical_ops.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,4 @@ class CategoricalOps(DataTypeOps):
2525

2626
@property
2727
def pretty_name(self) -> str:
28-
return 'categoricals'
28+
return "categoricals"

python/pyspark/pandas/data_type_ops/complex_ops.py

+9-6
Original file line numberDiff line numberDiff line change
@@ -34,22 +34,25 @@ class ArrayOps(DataTypeOps):
3434

3535
@property
3636
def pretty_name(self) -> str:
37-
return 'arrays'
37+
return "arrays"
3838

3939
def add(self, left, right) -> Union["Series", "Index"]:
4040
if not isinstance(right, IndexOpsMixin) or (
4141
isinstance(right, IndexOpsMixin) and not isinstance(right.spark.data_type, ArrayType)
4242
):
4343
raise TypeError(
44-
"Concatenation can not be applied to %s and the given type." % self.pretty_name)
44+
"Concatenation can not be applied to %s and the given type." % self.pretty_name
45+
)
4546

4647
left_type = left.spark.data_type.elementType
4748
right_type = right.spark.data_type.elementType
4849

4950
if left_type != right_type and not (
50-
isinstance(left_type, NumericType) and isinstance(right_type, NumericType)):
51+
isinstance(left_type, NumericType) and isinstance(right_type, NumericType)
52+
):
5153
raise TypeError(
52-
"Concatenation can only be applied to %s of the same type" % self.pretty_name)
54+
"Concatenation can only be applied to %s of the same type" % self.pretty_name
55+
)
5356

5457
return column_op(F.concat)(left, right)
5558

@@ -61,7 +64,7 @@ class MapOps(DataTypeOps):
6164

6265
@property
6366
def pretty_name(self) -> str:
64-
return 'maps'
67+
return "maps"
6568

6669

6770
class StructOps(DataTypeOps):
@@ -71,4 +74,4 @@ class StructOps(DataTypeOps):
7174

7275
@property
7376
def pretty_name(self) -> str:
74-
return 'structs'
77+
return "structs"

0 commit comments

Comments
 (0)