Scan a parquet file

Description

Scan a parquet file

Usage

pl_scan_parquet(
  source,
  ...,
  n_rows = NULL,
  row_index_name = NULL,
  row_index_offset = 0L,
  parallel = c("auto", "columns", "row_groups", "none"),
  hive_partitioning = TRUE,
  rechunk = FALSE,
  low_memory = FALSE,
  storage_options = NULL,
  use_statistics = TRUE,
  cache = TRUE
)

Arguments

`source`	Path to a file. You can use globbing with `\*` to scan/read multiple files in the same directory (see examples).
`…`	Ignored.
`n_rows`	Maximum number of rows to read.
`row_index_name`	If not `NULL`, this will insert a row index column with the given name into the DataFrame.
`row_index_offset`	Offset to start the row index column (only used if the name is set).
`parallel`	This determines the direction of parallelism. `“auto”` will try to determine the optimal direction. Can be `“auto”`, `“columns”`, `“row_groups”`, or `“none”`.
`hive_partitioning`	Infer statistics and schema from hive partitioned URL and use them to prune reads.
`rechunk`	In case of reading multiple files via a glob pattern, rechunk the final DataFrame into contiguous memory chunks.
`low_memory`	Reduce memory usage (will yield a lower performance).
`storage_options`	Experimental. List of options necessary to scan parquet files from different cloud storage providers (GCP, AWS, Azure). See the ‘Details’ section.
`use_statistics`	Use statistics in the parquet file to determine if pages can be skipped from reading.
`cache`	Cache the result after reading.

Details

Connecting to cloud providers

Polars supports scanning parquet files from different cloud providers. The cloud providers currently supported are AWS, GCP, and Azure. The supported keys to pass to the storage_options argument can be found here:

Implementation details

Currently it is impossible to scan public parquet files from GCP without a valid service account. Be sure to always include a service account in the storage_options argument.

Value

LazyFrame

Examples

library(polars)


temp_dir = tempfile()
# Write a hive-style partitioned parquet dataset
arrow::write_dataset(
  mtcars,
  temp_dir,
  partitioning = c("cyl", "gear"),
  format = "parquet",
  hive_style = TRUE
)
list.files(temp_dir, recursive = TRUE)

#> [1] "cyl=4/gear=3/part-0.parquet" "cyl=4/gear=4/part-0.parquet"
#> [3] "cyl=4/gear=5/part-0.parquet" "cyl=6/gear=3/part-0.parquet"
#> [5] "cyl=6/gear=4/part-0.parquet" "cyl=6/gear=5/part-0.parquet"
#> [7] "cyl=8/gear=3/part-0.parquet" "cyl=8/gear=5/part-0.parquet"

# Read the dataset
pl$scan_parquet(
  file.path(temp_dir, "**/*.parquet")
)$collect()

#> shape: (32, 11)
#> ┌──────┬───────┬───────┬──────┬───┬─────┬──────┬─────┬──────┐
#> │ mpg  ┆ disp  ┆ hp    ┆ drat ┆ … ┆ am  ┆ carb ┆ cyl ┆ gear │
#> │ ---  ┆ ---   ┆ ---   ┆ ---  ┆   ┆ --- ┆ ---  ┆ --- ┆ ---  │
#> │ f64  ┆ f64   ┆ f64   ┆ f64  ┆   ┆ f64 ┆ f64  ┆ i64 ┆ i64  │
#> ╞══════╪═══════╪═══════╪══════╪═══╪═════╪══════╪═════╪══════╡
#> │ 21.5 ┆ 120.1 ┆ 97.0  ┆ 3.7  ┆ … ┆ 0.0 ┆ 1.0  ┆ 4   ┆ 3    │
#> │ 22.8 ┆ 108.0 ┆ 93.0  ┆ 3.85 ┆ … ┆ 1.0 ┆ 1.0  ┆ 4   ┆ 4    │
#> │ 24.4 ┆ 146.7 ┆ 62.0  ┆ 3.69 ┆ … ┆ 0.0 ┆ 2.0  ┆ 4   ┆ 4    │
#> │ 22.8 ┆ 140.8 ┆ 95.0  ┆ 3.92 ┆ … ┆ 0.0 ┆ 2.0  ┆ 4   ┆ 4    │
#> │ 32.4 ┆ 78.7  ┆ 66.0  ┆ 4.08 ┆ … ┆ 1.0 ┆ 1.0  ┆ 4   ┆ 4    │
#> │ …    ┆ …     ┆ …     ┆ …    ┆ … ┆ …   ┆ …    ┆ …   ┆ …    │
#> │ 15.2 ┆ 304.0 ┆ 150.0 ┆ 3.15 ┆ … ┆ 0.0 ┆ 2.0  ┆ 8   ┆ 3    │
#> │ 13.3 ┆ 350.0 ┆ 245.0 ┆ 3.73 ┆ … ┆ 0.0 ┆ 4.0  ┆ 8   ┆ 3    │
#> │ 19.2 ┆ 400.0 ┆ 175.0 ┆ 3.08 ┆ … ┆ 0.0 ┆ 2.0  ┆ 8   ┆ 3    │
#> │ 15.8 ┆ 351.0 ┆ 264.0 ┆ 4.22 ┆ … ┆ 1.0 ┆ 4.0  ┆ 8   ┆ 5    │
#> │ 15.0 ┆ 301.0 ┆ 335.0 ┆ 3.54 ┆ … ┆ 1.0 ┆ 8.0  ┆ 8   ┆ 5    │
#> └──────┴───────┴───────┴──────┴───┴─────┴──────┴─────┴──────┘