Skip to content

Commit 67b6f0e

Browse files
wayneguowsrowen
authored andcommitted
[SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource
### What changes were proposed in this pull request? Pass the comment option through to univocity if users set it explicitly in CSV dataSource. ### Why are the changes needed? In #29516 , in order to fix some bugs, univocity-parsers was upgrade from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. Before this change: #abc,1 After this change: "#abc",1 We change the related `isCommentSet` check logic to enable users to keep behavior as before. ### Does this PR introduce _any_ user-facing change? Yes, a little. If users set comment option as '\u0000' explicitly, now they should remove it to keep comment option unset. ### How was this patch tested? Add a full new test. Closes #39878 from wayneguow/comment. Authored-by: wayneguow <[email protected]> Signed-off-by: Sean Owen <[email protected]>
1 parent b11fba0 commit 67b6f0e

File tree

2 files changed

+51
-1
lines changed
  • sql
    • catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv
    • core/src/test/scala/org/apache/spark/sql/execution/datasources/csv

2 files changed

+51
-1
lines changed

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -222,7 +222,10 @@ class CSVOptions(
222222
*/
223223
val maxErrorContentLength = 1000
224224

225-
val isCommentSet = this.comment != '\u0000'
225+
val isCommentSet = parameters.get(COMMENT) match {
226+
case Some(value) if value.length == 1 => true
227+
case _ => false
228+
}
226229

227230
val samplingRatio =
228231
parameters.get(SAMPLING_RATIO).map(_.toDouble).getOrElse(1.0)

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3101,6 +3101,53 @@ abstract class CSVSuite
31013101
}
31023102
}
31033103

3104+
test("SPARK-42335: Pass the comment option through to univocity " +
3105+
"if users set it explicitly in CSV dataSource") {
3106+
withTempPath { path =>
3107+
Seq("#abc", "\u0000def", "xyz").toDF()
3108+
.write.option("comment", "\u0000").csv(path.getCanonicalPath)
3109+
checkAnswer(
3110+
spark.read.text(path.getCanonicalPath),
3111+
Seq(Row("#abc"), Row("\"def\""), Row("xyz"))
3112+
)
3113+
}
3114+
withTempPath { path =>
3115+
Seq("#abc", "\u0000def", "xyz").toDF()
3116+
.write.option("comment", "#").csv(path.getCanonicalPath)
3117+
checkAnswer(
3118+
spark.read.text(path.getCanonicalPath),
3119+
Seq(Row("\"#abc\""), Row("def"), Row("xyz"))
3120+
)
3121+
}
3122+
withTempPath { path =>
3123+
Seq("#abc", "\u0000def", "xyz").toDF()
3124+
.write.csv(path.getCanonicalPath)
3125+
checkAnswer(
3126+
spark.read.text(path.getCanonicalPath),
3127+
Seq(Row("\"#abc\""), Row("def"), Row("xyz"))
3128+
)
3129+
}
3130+
withTempPath { path =>
3131+
Seq("#abc", "\u0000def", "xyz").toDF().write.text(path.getCanonicalPath)
3132+
checkAnswer(
3133+
spark.read.option("comment", "\u0000").csv(path.getCanonicalPath),
3134+
Seq(Row("#abc"), Row("xyz")))
3135+
}
3136+
withTempPath { path =>
3137+
Seq("#abc", "\u0000def", "xyz").toDF().write.text(path.getCanonicalPath)
3138+
checkAnswer(
3139+
spark.read.option("comment", "#").csv(path.getCanonicalPath),
3140+
Seq(Row("\u0000def"), Row("xyz")))
3141+
}
3142+
withTempPath { path =>
3143+
Seq("#abc", "\u0000def", "xyz").toDF().write.text(path.getCanonicalPath)
3144+
checkAnswer(
3145+
spark.read.csv(path.getCanonicalPath),
3146+
Seq(Row("#abc"), Row("\u0000def"), Row("xyz"))
3147+
)
3148+
}
3149+
}
3150+
31043151
test("SPARK-40667: validate CSV Options") {
31053152
assert(CSVOptions.getAllOptions.size == 38)
31063153
// Please add validation on any new CSV options here

0 commit comments

Comments
 (0)