Skip to content

Commit 04024fd

Browse files
zhengruifengHyukjinKwon
authored andcommitted
[SPARK-44887][DOCS] Fix wildcard import from pyspark.sql.functions import * in Quick Start Examples
### What changes were proposed in this pull request? Fix wildcard import `from pyspark.sql.functions import *` in https://spark.apache.org/docs/latest/quick-start.html ### Why are the changes needed? to follow the [PEP 8 - Style Guide for Python Code](https://peps.python.org/pep-0008/) > Wildcard imports (from <module> import *) should be avoided, as they make it unclear which names are present in the namespace, confusing both readers and many automated tools. There is one defensible use case for a wildcard import, which is to republish an internal interface as part of a public API (for example, overwriting a pure Python implementation of an interface with the definitions from an optional accelerator module and exactly which definitions will be overwritten isn’t known in advance). When republishing names this way, the guidelines below regarding public and internal interfaces still apply. to avoid potential namespace conflicts, since there are several sql functions already shared the same names with built-in modules/functions (e.g. `min`/`max`/`sum`/`hash`) ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #42579 from zhengruifeng/docs_avoid_wildcard_imports. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
1 parent f351e7b commit 04024fd

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

docs/quick-start.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -130,8 +130,8 @@ Dataset actions and transformations can be used for more complex computations. L
130130
<div data-lang="python" markdown="1">
131131

132132
{% highlight python %}
133-
>>> from pyspark.sql.functions import *
134-
>>> textFile.select(size(split(textFile.value, "\s+")).name("numWords")).agg(max(col("numWords"))).collect()
133+
>>> from pyspark.sql import functions as F
134+
>>> textFile.select(F.size(F.split(textFile.value, "\s+")).name("numWords")).agg(F.max(F.col("numWords"))).collect()
135135
[Row(max(numWords)=15)]
136136
{% endhighlight %}
137137

@@ -140,7 +140,7 @@ This first maps a line to an integer value and aliases it as "numWords", creatin
140140
One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
141141

142142
{% highlight python %}
143-
>>> wordCounts = textFile.select(explode(split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
143+
>>> wordCounts = textFile.select(F.explode(F.split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
144144
{% endhighlight %}
145145

146146
Here, we use the `explode` function in `select`, to transform a Dataset of lines to a Dataset of words, and then combine `groupBy` and `count` to compute the per-word counts in the file as a DataFrame of 2 columns: "word" and "count". To collect the word counts in our shell, we can call `collect`:

0 commit comments

Comments
 (0)