Skip to content

Commit 4634c89

Browse files
yutannihilationnealrichardson
authored andcommitted
ARROW-7045: [R] Preserve factor in Parquet roundtrip
The ability to preserve categorical values was introduced in #5077 as the convention of storing a special `ARROW:schema` key in the metadata. To invoke this, we need to call `ArrowWriterProperties::store_schema()`. The R binding is already ready for this, but calls `store_schema()` only conditionally and uses `parquet___default_arrow_writer_properties()` by default. Though I don't see the motivation to implement as such in #5451, considering [the Python binding always calls `store_schema()`](https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/python/pyarrow/_parquet.pyx#L1269), I guess the R code can do the same. Closes #6135 from yutannihilation/ARROW-7045_preserve_factor_in_parquet and squashes the following commits: 9227e7e <Hiroaki Yutani> Fix test 4d8bb46 <Hiroaki Yutani> Remove default_arrow_writer_properties() dfd08cb <Hiroaki Yutani> Add failing tests Authored-by: Hiroaki Yutani <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
1 parent fb3006a commit 4634c89

File tree

5 files changed

+20
-35
lines changed

5 files changed

+20
-35
lines changed

r/R/arrowExports.R

Lines changed: 0 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/R/parquet.R

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -205,16 +205,12 @@ ParquetArrowWriterPropertiesBuilder <- R6Class("ParquetArrowWriterPropertiesBuil
205205
ParquetArrowWriterProperties <- R6Class("ParquetArrowWriterProperties", inherit = Object)
206206

207207
ParquetArrowWriterProperties$create <- function(use_deprecated_int96_timestamps = FALSE, coerce_timestamps = NULL, allow_truncated_timestamps = FALSE) {
208-
if (!use_deprecated_int96_timestamps && is.null(coerce_timestamps) && !allow_truncated_timestamps) {
209-
shared_ptr(ParquetArrowWriterProperties, parquet___default_arrow_writer_properties())
210-
} else {
211-
builder <- shared_ptr(ParquetArrowWriterPropertiesBuilder, parquet___ArrowWriterProperties___Builder__create())
212-
builder$store_schema()
213-
builder$set_int96_support(use_deprecated_int96_timestamps)
214-
builder$set_coerce_timestamps(coerce_timestamps)
215-
builder$set_allow_truncated_timestamps(allow_truncated_timestamps)
216-
shared_ptr(ParquetArrowWriterProperties, parquet___ArrowWriterProperties___Builder__build(builder))
217-
}
208+
builder <- shared_ptr(ParquetArrowWriterPropertiesBuilder, parquet___ArrowWriterProperties___Builder__create())
209+
builder$store_schema()
210+
builder$set_int96_support(use_deprecated_int96_timestamps)
211+
builder$set_coerce_timestamps(coerce_timestamps)
212+
builder$set_allow_truncated_timestamps(allow_truncated_timestamps)
213+
shared_ptr(ParquetArrowWriterProperties, parquet___ArrowWriterProperties___Builder__build(builder))
218214
}
219215

220216
valid_parquet_version <- c(

r/src/arrowExports.cpp

Lines changed: 0 additions & 15 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/src/parquet.cpp

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -82,12 +82,6 @@ std::shared_ptr<arrow::Table> parquet___arrow___FileReader__ReadTable2(
8282
return table;
8383
}
8484

85-
// [[arrow::export]]
86-
std::shared_ptr<parquet::ArrowWriterProperties>
87-
parquet___default_arrow_writer_properties() {
88-
return parquet::default_arrow_writer_properties();
89-
}
90-
9185
// [[arrow::export]]
9286
std::shared_ptr<parquet::ArrowWriterProperties::Builder>
9387
parquet___ArrowWriterProperties___Builder__create() {

r/tests/testthat/test-parquet.R

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,3 +106,17 @@ test_that("write_parquet() defaults to snappy compression", {
106106
write_parquet(mtcars, tmp2, compression = "snappy")
107107
expect_equal(file.size(tmp1), file.size(tmp2))
108108
})
109+
110+
test_that("Factors are preserved when writing/reading from Parquet", {
111+
fct <- factor(c("a", "b"), levels = c("c", "a", "b"))
112+
ord <- factor(c("a", "b"), levels = c("c", "a", "b"), ordered = TRUE)
113+
chr <- c("a", "b")
114+
df <- tibble::tibble(fct = fct, ord = ord, chr = chr)
115+
116+
pq_tmp_file <- tempfile()
117+
on.exit(unlink(pq_tmp_file))
118+
119+
write_parquet(df, pq_tmp_file)
120+
df_read <- read_parquet(pq_tmp_file)
121+
expect_identical(df, df_read)
122+
})

0 commit comments

Comments
 (0)