Skip to content

Map an expression with an R function

Source code

Description

Map an expression with an R function

Usage

<Expr>$map_batches(
  f,
  output_type = NULL,
  agg_list = FALSE,
  in_background = FALSE
)

Arguments

f a function to map with
output_type NULL or a type available in names(pl$dtypes). If NULL (default), the output datatype will match the input datatype. This is used to inform schema of the actual return type of the R function. Setting this wrong could theoretically have some downstream implications to the query.
agg_list Aggregate list. Map from vector to group in group_by context.
in_background Logical. Whether to execute the map in a background R process. Combined with setting e.g. options(polars.rpool_cap = 4) it can speed up some slow R functions as they can run in parallel R sessions. The communication speed between processes is quite slower than between threads. This will likely only give a speed-up in a "low IO - high CPU" use case. If there are multiple $map_batches(in_background = TRUE) calls in the query, they will be run in parallel.

Details

It is sometimes necessary to apply a specific R function on one or several columns. However, note that using R code in $map_batches() is slower than native polars. The user function must take one polars Series as input and the return should be a Series or any Robj convertible into a Series (e.g. vectors). Map fully supports browser().

If in_background = FALSE the function can access any global variable of the R session. However, note that several calls to $map_batches() will sequentially share the same main R session, so the global environment might change between the start of the query and the moment a $map_batches() call is evaluated. Any native polars computations can still be executed meanwhile. If in_background = TRUE, the map will run in one or more other R sessions and will not have access to global variables. Use options(polars.rpool_cap = 4) and polars_options()$rpool_cap to set and view number of parallel R sessions.

Value

Expr

Examples

library(polars)

pl$DataFrame(iris)$
  select(
  pl$col("Sepal.Length")$map_batches(\(x) {
    paste("cheese", as.character(x$to_vector()))
  }, pl$dtypes$String)
)
#> shape: (150, 1)
#> ┌──────────────┐
#> │ Sepal.Length │
#> │ ---          │
#> │ str          │
#> ╞══════════════╡
#> │ cheese 5.1   │
#> │ cheese 4.9   │
#> │ cheese 4.7   │
#> │ cheese 4.6   │
#> │ cheese 5     │
#> │ …            │
#> │ cheese 6.7   │
#> │ cheese 6.3   │
#> │ cheese 6.5   │
#> │ cheese 6.2   │
#> │ cheese 5.9   │
#> └──────────────┘
# R parallel process example, use Sys.sleep() to imitate some CPU expensive
# computation.

# map a,b,c,d sequentially
pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select(
  pl$all()$map_batches(\(s) {
    Sys.sleep(.1)
    s * 2
  })
)$collect() |> system.time()
#>    user  system elapsed 
#>   0.023   0.004   0.427
# map in parallel 1: Overhead to start up extra R processes / sessions
options(polars.rpool_cap = 0) # drop any previous processes, just to show start-up overhead
options(polars.rpool_cap = 4) # set back to 4, the default
polars_options()$rpool_cap
#> [1] 4
pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select(
  pl$all()$map_batches(\(s) {
    Sys.sleep(.1)
    s * 2
  }, in_background = TRUE)
)$collect() |> system.time()
#>    user  system elapsed 
#>   0.008   0.004   1.026
# map in parallel 2: Reuse R processes in "polars global_rpool".
polars_options()$rpool_cap
#> [1] 4
pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select(
  pl$all()$map_batches(\(s) {
    Sys.sleep(.1)
    s * 2
  }, in_background = TRUE)
)$collect() |> system.time()
#>    user  system elapsed 
#>   0.009   0.001   0.133