[SPARK-44887][DOCS] Fix wildcard import from pyspark.sql.functions import * in Quick Start Examples

zhengruifeng · HyukjinKwon · commit 04024fd573a2 · 2023-08-21T13:24:57.000+09:00
### What changes were proposed in this pull request? Fix wildcard import `from pyspark.sql.functions import *` in https://spark.apache.org/docs/latest/quick-start.html ### Why are the changes needed? to follow the [PEP 8 - Style Guide for Python Code](https://peps.python.org/pep-0008/) > Wildcard imports (from <module> import *) should be avoided, as they make it unclear which names are present in the namespace, confusing both readers and many automated tools. There is one defensible use case for a wildcard import, which is to republish an internal interface as part of a public API (for example, overwriting a pure Python implementation of an interface with the definitions from an optional accelerator module and exactly which definitions will be overwritten isn’t known in advance). When republishing names this way, the guidelines below regarding public and internal interfaces still apply. to avoid potential namespace conflicts, since there are several sql functions already shared the same names with built-in modules/functions (e.g. `min`/`max`/`sum`/`hash`) ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #42579 from zhengruifeng/docs_avoid_wildcard_imports. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
diff --git a/docs/quick-start.md b/docs/quick-start.md
@@ -130,8 +130,8 @@ Dataset actions and transformations can be used for more complex computations. L
 <div data-lang="python" markdown="1">
 
 {% highlight python %}
->>> from pyspark.sql.functions import *
->>> textFile.select(size(split(textFile.value, "\s+")).name("numWords")).agg(max(col("numWords"))).collect()
+>>> from pyspark.sql import functions as F
+>>> textFile.select(F.size(F.split(textFile.value, "\s+")).name("numWords")).agg(F.max(F.col("numWords"))).collect()
 [Row(max(numWords)=15)]
 {% endhighlight %}
 
@@ -140,7 +140,7 @@ This first maps a line to an integer value and aliases it as "numWords", creatin
 One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
 
 {% highlight python %}
->>> wordCounts = textFile.select(explode(split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
+>>> wordCounts = textFile.select(F.explode(F.split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
 {% endhighlight %}
 
 Here, we use the `explode` function in `select`, to transform a Dataset of lines to a Dataset of words, and then combine `groupBy` and `count` to compute the per-word counts in the file as a DataFrame of 2 columns: "word" and "count". To collect the word counts in our shell, we can call `collect`: