Skip to content

Join LazyFrames

Source code

Description

This function can do both mutating joins (adding columns based on matching observations, for example with how = “left”) and filtering joins (keeping observations based on matching observations, for example with how = “inner”).

Usage

<LazyFrame>$join(
  other,
  on = NULL,
  how = c("inner", "left", "outer", "semi", "anti", "cross", "outer_coalesce"),
  ...,
  left_on = NULL,
  right_on = NULL,
  suffix = "_right",
  validate = "m:m",
  join_nulls = FALSE,
  allow_parallel = TRUE,
  force_parallel = FALSE
)

Arguments

other LazyFrame to join with.
on Either a vector of column names or a list of expressions and/or strings. Use left_on and right_on if the column names to match on are different between the two DataFrames.
how One of the following methods: "inner", "left", "outer", "semi", "anti", "cross", "outer_coalesce".
Ignored.
left_on, right_on Same as on but only for the left or the right DataFrame. They must have the same length.
suffix Suffix to add to duplicated column names.
validate Checks if join is of specified type:
  • “m:m” (default): many-to-many, doesn’t perform any checks;
  • “1:1”: one-to-one, check if join keys are unique in both left and right datasets;
  • “1:m”: one-to-many, check if join keys are unique in left dataset
  • “m:1”: many-to-one, check if join keys are unique in right dataset
Note that this is currently not supported by the streaming engine, and is only supported when joining by single columns.
join_nulls Join on null values. By default null values will never produce matches.
allow_parallel Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
force_parallel Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

Value

LazyFrame

Examples

library(polars)

# inner join by default
df1 = pl$LazyFrame(list(key = 1:3, payload = c("f", "i", NA)))
df2 = pl$LazyFrame(list(key = c(3L, 4L, 5L, NA_integer_)))
df1$join(other = df2, on = "key")
#> polars LazyFrame
#>  $describe_optimized_plan() : Show the optimized query plan.
#> 
#> Naive plan:
#> INNER JOIN:
#> LEFT PLAN ON: [col("key")]
#>   DF ["key", "payload"]; PROJECT */2 COLUMNS; SELECTION: "None"
#> RIGHT PLAN ON: [col("key")]
#>   DF ["key"]; PROJECT */1 COLUMNS; SELECTION: "None"
#> END INNER JOIN
# cross join
df1 = pl$LazyFrame(x = letters[1:3])
df2 = pl$LazyFrame(y = 1:4)
df1$join(other = df2, how = "cross")
#> polars LazyFrame
#>  $describe_optimized_plan() : Show the optimized query plan.
#> 
#> Naive plan:
#> CROSS JOIN:
#> LEFT PLAN ON: [col("x")]
#>   DF ["x"]; PROJECT */1 COLUMNS; SELECTION: "None"
#> RIGHT PLAN ON: [col("y")]
#>   DF ["y"]; PROJECT */1 COLUMNS; SELECTION: "None"
#> END CROSS JOIN
# use "validate" to ensure join keys are not duplicated
df1 = pl$LazyFrame(x = letters[1:5], y = 1:5)
df2 = pl$LazyFrame(x = c("a", letters[1:4]), y2 = 6:10)

# this throws an error because there are two keys in df2 that match the key
# in df1
tryCatch(
  df1$join(df2, on = "x", validate = "1:1")$collect(),
  error = function(e) print(e)
)
#> <RPolarsErr_error: Execution halted with the following contexts
#>    0: In R: in $collect():
#>    0: During function call [.main()]
#>    1: Encountered the following error in Rust-Polars:
#>          the join keys did not fulfil 1:1 validation
#> >