Inner workings of the DataFrame-class
Description
The DataFrame
-class is simply two environments of
respectively the public and private methods/function calls to the polars
Rust side. The instantiated DataFrame
-object is an
externalptr
to a low-level Rust polars DataFrame object.
The S3 method .DollarNames.RPolarsDataFrame
exposes all
public $foobar()
-methods which
are callable onto the object. Most methods return another
DataFrame
- class instance or similar which allows for
method chaining. This class system could be called "environment classes"
(in lack of a better name) and is the same class system
extendr
provides, except here there are both a public and
private set of methods. For implementation reasons, the private methods
are external and must be called from
.pr$DataFrame$methodname()
. Also, all private methods must
take any self
as an argument, thus they are pure functions.
Having the private methods as pure functions solved/simplified
self-referential complications.
Details
Check out the source code in
R/dataframe_frame.R
to see how public methods are derived from private methods. Check out
extendr-wrappers.R
to see the extendr
-auto-generated methods. These are moved
to .pr
and converted into pure external functions in
after-wrappers.R.
In
zzz.R
(named zzz
to be last file sourced) the
extendr
-methods are removed and replaced by any function
prefixed DataFrame_
.
Active bindings
columns
$columns
returns a character
vector with the column names.
dtypes
$dtypes
returns a unnamed list
with the data type of each column.
flags
$flags
returns a nested list with
column names at the top level and column flags in each sublist.
Flags are used internally to avoid doing unnecessary computations, such
as sorting a variable that we know is already sorted. The number of
flags varies depending on the column type: columns of type
array
and list
have the flags
SORTED_ASC
, SORTED_DESC
, and
FAST_EXPLODE
, while other column types only have the former
two.
-
SORTED_ASC
is set toTRUE
when we sort a column in increasing order, so that we can use this information later on to avoid re-sorting it. -
SORTED_DESC
is similar but applies to sort in decreasing order.
height
$height
returns the number of
rows in the DataFrame.
schema
$schema
returns a named list with
the data type of each column.
shape
$shape
returns a numeric vector
of length two with the number of rows and the number of columns.
width
$width
returns the number of
columns in the DataFrame.
Conversion to R data types considerations
When converting Polars objects, such as DataFrames to R objects, for
example via the as.data.frame()
generic function, each type
in the Polars object is converted to an R type. In some cases, an error
may occur because the conversion is not appropriate. In particular,
there is a high possibility of an error when converting a Datetime type
without a time zone. A Datetime type without a time zone in Polars is
converted to the POSIXct type in R, which takes into account the time
zone in which the R session is running (which can be checked with the
Sys.timezone()
function). In this case, if ambiguous times
are included, a conversion error will occur. In such cases, change the
session time zone using Sys.setenv(TZ = "UTC")
and then
perform the conversion, or use the $dt$replace_time_zone()
method on the Datetime type column to explicitly specify the time zone
before conversion.
# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am # so this particular date-time doesn't exist non_existent_time = as_polars_series("2020-03-08 02:00:00")\$str\$strptime(pl\$Datetime(), "%F %T") withr::with_envvar( new = c(TZ = "America/New_York"), { tryCatch( # This causes an error due to the time zone (the `TZ` env var is affected). as.vector(non_existent_time), error = function(e) e ) } ) #> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()> withr::with_envvar( new = c(TZ = "America/New_York"), { # This is safe. as.vector(non_existent_time\$dt\$replace_time_zone("UTC")) } ) #> [1] "2020-03-08 02:00:00 UTC"
Examples
library(polars)
# see all public exported method names (normally accessed via a class
# instance with $)
ls(.pr$env$RPolarsDataFrame)
#> [1] "clear" "clone" "columns" "describe"
#> [5] "drop" "drop_in_place" "drop_nulls" "dtype_strings"
#> [9] "dtypes" "equals" "estimated_size" "explode"
#> [13] "fill_nan" "fill_null" "filter" "first"
#> [17] "flags" "get_column" "get_columns" "glimpse"
#> [21] "group_by" "group_by_dynamic" "head" "height"
#> [25] "item" "join" "join_asof" "last"
#> [29] "lazy" "limit" "max" "mean"
#> [33] "median" "melt" "min" "n_chunks"
#> [37] "null_count" "partition_by" "pivot" "print"
#> [41] "quantile" "rechunk" "rename" "reverse"
#> [45] "rolling" "sample" "schema" "select"
#> [49] "select_seq" "shape" "shift" "shift_and_fill"
#> [53] "slice" "sort" "sql" "std"
#> [57] "sum" "tail" "to_data_frame" "to_list"
#> [61] "to_series" "to_struct" "transpose" "unique"
#> [65] "unnest" "var" "width" "with_columns"
#> [69] "with_columns_seq" "with_row_index" "write_csv" "write_ipc"
#> [73] "write_json" "write_ndjson" "write_parquet"
#> [1] "clear" "clone_in_rust"
#> [3] "columns" "default"
#> [5] "drop_all_in_place" "drop_in_place"
#> [7] "dtype_strings" "dtypes"
#> [9] "equals" "estimated_size"
#> [11] "export_stream" "from_arrow_record_batches"
#> [13] "get_column" "get_columns"
#> [15] "lazy" "melt"
#> [17] "n_chunks" "new_with_capacity"
#> [19] "null_count" "partition_by"
#> [21] "pivot_expr" "print"
#> [23] "rechunk" "sample_frac"
#> [25] "sample_n" "schema"
#> [27] "select" "select_at_idx"
#> [29] "select_seq" "set_column_from_robj"
#> [31] "set_column_from_series" "set_column_names_mut"
#> [33] "shape" "to_list"
#> [35] "to_list_tag_structs" "to_list_unwind"
#> [37] "to_struct" "transpose"
#> [39] "unnest" "with_columns"
#> [41] "with_columns_seq" "with_row_index"
#> [43] "write_csv" "write_ipc"
#> [45] "write_json" "write_ndjson"
#> [47] "write_parquet"
#> [1] 150 5
# use a private method, which has mutability
result = .pr$DataFrame$set_column_from_robj(df, 150:1, "some_ints")
# Column exists in both dataframes-objects now, as they are just pointers to
# the same object
# There are no public methods with mutability.
df2 = df
df$columns
#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#> [6] "some_ints"
#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#> [6] "some_ints"
#> $Sepal.Length
#> $Sepal.Length$SORTED_ASC
#> [1] TRUE
#>
#> $Sepal.Length$SORTED_DESC
#> [1] FALSE
#>
#>
#> $Sepal.Width
#> $Sepal.Width$SORTED_ASC
#> [1] FALSE
#>
#> $Sepal.Width$SORTED_DESC
#> [1] FALSE
#>
#>
#> $Petal.Length
#> $Petal.Length$SORTED_ASC
#> [1] FALSE
#>
#> $Petal.Length$SORTED_DESC
#> [1] FALSE
#>
#>
#> $Petal.Width
#> $Petal.Width$SORTED_ASC
#> [1] FALSE
#>
#> $Petal.Width$SORTED_DESC
#> [1] FALSE
#>
#>
#> $Species
#> $Species$SORTED_ASC
#> [1] FALSE
#>
#> $Species$SORTED_DESC
#> [1] FALSE
#>
#>
#> $some_ints
#> $some_ints$SORTED_ASC
#> [1] FALSE
#>
#> $some_ints$SORTED_DESC
#> [1] FALSE
# set_column_from_robj-method is fallible and returned a result which could
# be "ok" or an error.
# No public method or function will ever return a result.
# The `result` is very close to the same as output from functions decorated
# with purrr::safely.
# To use results on the R side, these must be unwrapped first such that
# potentially errors can be thrown. `unwrap(result)` is a way to communicate
# errors happening on the Rust side to the R side. `Extendr` default behavior
# is to use `panic!`(s) which would cause some unnecessarily confusing and
# some very verbose error messages on the inner workings of rust.
# `unwrap(result)` in this case no error, just a NULL because this mutable
# method does not return any ok-value.
# Try unwrapping an error from polars due to unmatching column lengths
err_result = .pr$DataFrame$set_column_from_robj(df, 1:10000, "wrong_length")
tryCatch(unwrap(err_result, call = NULL), error = \(e) cat(as.character(e)))
#> Error in unwrap(err_result, call = NULL): could not find function "unwrap"