Inner workings of the DataFrame-class

Description

The DataFrame-class is simply two environments of respectively the public and private methods/function calls to the polars Rust side. The instantiated DataFrame-object is an externalptr to a low-level Rust polars DataFrame object.

The S3 method .DollarNames.RPolarsDataFrame exposes all public $foobar()-methods which are callable onto the object. Most methods return another DataFrame- class instance or similar which allows for method chaining. This class system could be called "environment classes" (in lack of a better name) and is the same class system extendr provides, except here there are both a public and private set of methods. For implementation reasons, the private methods are external and must be called from .pr$DataFrame$methodname(). Also, all private methods must take any self as an argument, thus they are pure functions. Having the private methods as pure functions solved/simplified self-referential complications.

Details

Check out the source code in R/dataframe_frame.R to see how public methods are derived from private methods. Check out extendr-wrappers.R to see the extendr-auto-generated methods. These are moved to .pr and converted into pure external functions in after-wrappers.R. In zzz.R (named zzz to be last file sourced) the extendr-methods are removed and replaced by any function prefixed DataFrame_.

Active bindings

columns

$columns returns a character vector with the column names.

dtypes

$dtypes returns a unnamed list with the data type of each column.

flags

$flags returns a nested list with column names at the top level and column flags in each sublist.

Flags are used internally to avoid doing unnecessary computations, such as sorting a variable that we know is already sorted. The number of flags varies depending on the column type: columns of type array and list have the flags SORTED_ASC, SORTED_DESC, and FAST_EXPLODE, while other column types only have the former two.

SORTED_ASC is set to TRUE when we sort a column in increasing order, so that we can use this information later on to avoid re-sorting it.
SORTED_DESC is similar but applies to sort in decreasing order.

height

$height returns the number of rows in the DataFrame.

schema

$schema returns a named list with the data type of each column.

shape

$shape returns a numeric vector of length two with the number of rows and the number of columns.

width

$width returns the number of columns in the DataFrame.

Conversion to R data types considerations

When converting Polars objects, such as DataFrames to R objects, for example via the as.data.frame() generic function, each type in the Polars object is converted to an R type. In some cases, an error may occur because the conversion is not appropriate. In particular, there is a high possibility of an error when converting a Datetime type without a time zone. A Datetime type without a time zone in Polars is converted to the POSIXct type in R, which takes into account the time zone in which the R session is running (which can be checked with the Sys.timezone() function). In this case, if ambiguous times are included, a conversion error will occur. In such cases, change the session time zone using Sys.setenv(TZ = "UTC") and then perform the conversion, or use the $dt$replace_time_zone() method on the Datetime type column to explicitly specify the time zone before conversion.

# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am
# so this particular date-time doesn't exist
non_existent_time = as_polars_series("2020-03-08 02:00:00")\$str\$strptime(pl\$Datetime(), "%F %T")

withr::with_envvar(
  new = c(TZ = "America/New_York"),
  {
    tryCatch(
      # This causes an error due to the time zone (the `TZ` env var is affected).
      as.vector(non_existent_time),
      error = function(e) e
    )
  }
)
#> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()>

withr::with_envvar(
  new = c(TZ = "America/New_York"),
  {
    # This is safe.
    as.vector(non_existent_time\$dt\$replace_time_zone("UTC"))
  }
)
#> [1] "2020-03-08 02:00:00 UTC"

Examples

library(polars)

# see all public exported method names (normally accessed via a class
# instance with $)
ls(.pr$env$RPolarsDataFrame)

#>  [1] "clear"            "clone"            "columns"          "describe"        
#>  [5] "drop"             "drop_in_place"    "drop_nulls"       "dtype_strings"   
#>  [9] "dtypes"           "equals"           "estimated_size"   "explode"         
#> [13] "fill_nan"         "fill_null"        "filter"           "first"           
#> [17] "flags"            "get_column"       "get_columns"      "glimpse"         
#> [21] "group_by"         "group_by_dynamic" "head"             "height"          
#> [25] "item"             "join"             "join_asof"        "last"            
#> [29] "lazy"             "limit"            "max"              "mean"            
#> [33] "median"           "melt"             "min"              "n_chunks"        
#> [37] "null_count"       "partition_by"     "pivot"            "print"           
#> [41] "quantile"         "rechunk"          "rename"           "reverse"         
#> [45] "rolling"          "sample"           "schema"           "select"          
#> [49] "select_seq"       "shape"            "shift"            "shift_and_fill"  
#> [53] "slice"            "sort"             "sql"              "std"             
#> [57] "sum"              "tail"             "to_data_frame"    "to_list"         
#> [61] "to_series"        "to_struct"        "transpose"        "unique"          
#> [65] "unnest"           "var"              "width"            "with_columns"    
#> [69] "with_columns_seq" "with_row_index"   "write_csv"        "write_ipc"       
#> [73] "write_json"       "write_ndjson"     "write_parquet"

# see all private methods (not intended for regular use)
ls(.pr$DataFrame)

#>  [1] "clear"                     "clone_in_rust"            
#>  [3] "columns"                   "default"                  
#>  [5] "drop_all_in_place"         "drop_in_place"            
#>  [7] "dtype_strings"             "dtypes"                   
#>  [9] "equals"                    "estimated_size"           
#> [11] "export_stream"             "from_arrow_record_batches"
#> [13] "get_column"                "get_columns"              
#> [15] "lazy"                      "melt"                     
#> [17] "n_chunks"                  "new_with_capacity"        
#> [19] "null_count"                "partition_by"             
#> [21] "pivot_expr"                "print"                    
#> [23] "rechunk"                   "sample_frac"              
#> [25] "sample_n"                  "schema"                   
#> [27] "select"                    "select_at_idx"            
#> [29] "select_seq"                "set_column_from_robj"     
#> [31] "set_column_from_series"    "set_column_names_mut"     
#> [33] "shape"                     "to_list"                  
#> [35] "to_list_tag_structs"       "to_list_unwind"           
#> [37] "to_struct"                 "transpose"                
#> [39] "unnest"                    "with_columns"             
#> [41] "with_columns_seq"          "with_row_index"           
#> [43] "write_csv"                 "write_ipc"                
#> [45] "write_json"                "write_ndjson"             
#> [47] "write_parquet"

# make an object
df = as_polars_df(iris)

# call an active binding
df$shape

#> [1] 150   5

# use a private method, which has mutability
result = .pr$DataFrame$set_column_from_robj(df, 150:1, "some_ints")

# Column exists in both dataframes-objects now, as they are just pointers to
# the same object
# There are no public methods with mutability.
df2 = df

df$columns

#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
#> [6] "some_ints"

df2$columns

#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
#> [6] "some_ints"

# Show flags
df$sort("Sepal.Length")$flags

#> $Sepal.Length
#> $Sepal.Length$SORTED_ASC
#> [1] TRUE
#> 
#> $Sepal.Length$SORTED_DESC
#> [1] FALSE
#> 
#> 
#> $Sepal.Width
#> $Sepal.Width$SORTED_ASC
#> [1] FALSE
#> 
#> $Sepal.Width$SORTED_DESC
#> [1] FALSE
#> 
#> 
#> $Petal.Length
#> $Petal.Length$SORTED_ASC
#> [1] FALSE
#> 
#> $Petal.Length$SORTED_DESC
#> [1] FALSE
#> 
#> 
#> $Petal.Width
#> $Petal.Width$SORTED_ASC
#> [1] FALSE
#> 
#> $Petal.Width$SORTED_DESC
#> [1] FALSE
#> 
#> 
#> $Species
#> $Species$SORTED_ASC
#> [1] FALSE
#> 
#> $Species$SORTED_DESC
#> [1] FALSE
#> 
#> 
#> $some_ints
#> $some_ints$SORTED_ASC
#> [1] FALSE
#> 
#> $some_ints$SORTED_DESC
#> [1] FALSE

# set_column_from_robj-method is fallible and returned a result which could
# be "ok" or an error.
# No public method or function will ever return a result.
# The `result` is very close to the same as output from functions decorated
# with purrr::safely.
# To use results on the R side, these must be unwrapped first such that
# potentially errors can be thrown. `unwrap(result)` is a way to communicate
# errors happening on the Rust side to the R side. `Extendr` default behavior
# is to use `panic!`(s) which would cause some unnecessarily confusing and
# some very verbose error messages on the inner workings of rust.
# `unwrap(result)` in this case no error, just a NULL because this mutable
# method does not return any ok-value.

# Try unwrapping an error from polars due to unmatching column lengths
err_result = .pr$DataFrame$set_column_from_robj(df, 1:10000, "wrong_length")
tryCatch(unwrap(err_result, call = NULL), error = \(e) cat(as.character(e)))

#> Error in unwrap(err_result, call = NULL): could not find function "unwrap"