Fit a single linear probabilistic dropout model

The function works similar to the classical lm but with special handling of NA's. Whereas lm usually just ignores response value that are missing, pd_lm applies a probabilistic dropout model, that assumes that missing values occur because of the dropout curve. The dropout curve describes for each position the chance that that a value is missed. A negative dropout_curve_scale means that the lower the intensity was, the more likely it is to miss the value.

pd_lm(formula, data = NULL, subset = NULL, dropout_curve_position,
  dropout_curve_scale, location_prior_mean = NULL,
  location_prior_scale = NULL, variance_prior_scale = NULL,
  variance_prior_df = NULL, location_prior_df = 3,
  method = c("analytic_hessian", "analytic_grad", "numeric"),
  verbose = FALSE)

Arguments

formula	a formula that specifies a linear model
data	an optional data.frame whose columns can be used to specify the `formula`
subset	an optional selection vector for data to subset it
dropout_curve_position	the value where the chance to observe a value is 50%. Can either be a single value that is repeated for each row or a vector with one element for each row. Not optional.
dropout_curve_scale	the width of the dropout curve. Smaller values mean that the sigmoidal curve is steeper. Can either be a single value that is repeated for each row or a vector with one element for each row. Not optional.
location_prior_mean, location_prior_scale	the optional mean and variance of the prior around which the predictions are supposed to scatter. If no value is provided no location regularization is applied.
variance_prior_scale, variance_prior_df	the optional scale and degrees of freedom of the variance prior. If no value is provided no variance regularization is applied.
location_prior_df	The degrees of freedom for the t-distribution of the location prior. If it is large (> 30) the prior is approximately Normal. Default: 3
method	one of 'analytic_hessian', 'analytic_gradient', or 'numeric'. If 'analytic_hessian' the `nlminb` optimization routine is used, with the hand derived first and second derivative. Otherwise, `optim` either with or without the first derivative is used.
verbose	boolean that signals if the method prints informative messages. Default: `FALSE`.

Value

a list with the following entries

coefficients: a named vector with the fitted values
coef_variance_matrix: a p*p matrix with the variance associated with each coefficient estimate
n_approx: the estimated "size" of the data set (n_hat - variance_prior_df)
df: the estimated degrees of freedom (n_hat - p)
s2: the estimated unbiased variance
n_obs: the number of response values that were not `NA`

Examples

  # Without missing values
  y <- rnorm(5, mean=20)
  lm(y ~ 1)
#> 
#> Call:
#> lm(formula = y ~ 1)
#> 
#> Coefficients:
#> (Intercept)  
#>       20.02  
#> 
  pd_lm(y ~ 1,
        dropout_curve_position = NA,
        dropout_curve_scale = NA)
#> $coefficients
#> Intercept 
#>  20.02302 
#> 
#> $coef_variance_matrix
#>           Intercept
#> Intercept 0.2259314
#> 
#> $n_approx
#> [1] 5
#> 
#> $df
#> [1] 4
#> 
#> $s2
#> [1] 1.129657
#> 
#> $n_obs
#> [1] 5
#> 

  # With some missing values
  y <- c(23, 21.4, NA)
  lm(y ~ 1)
#> 
#> Call:
#> lm(formula = y ~ 1)
#> 
#> Coefficients:
#> (Intercept)  
#>        22.2  
#> 
  pd_lm(y ~ 1,
        dropout_curve_position = 19,
        dropout_curve_scale = -1)
#> $coefficients
#> Intercept 
#>  20.90007 
#> 
#> $coef_variance_matrix
#>           Intercept
#> Intercept  3.400309
#> 
#> $n_approx
#> [1] 1.451011
#> 
#> $df
#> [1] 0.4510113
#> 
#> $s2
#> [1] 13.95306
#> 
#> $n_obs
#> [1] 2
#> 


  # With only missing values
  y <- c(NA, NA, NA)
  # lm(y ~ 1)  # Fails
  pd_lm(y ~ 1,
        dropout_curve_position = 19,
        dropout_curve_scale = -1,
        location_prior_mean = 21,
        location_prior_scale = 3,
        variance_prior_scale = 0.1,
        variance_prior_df = 2)
#> $coefficients
#> Intercept 
#>  18.77828 
#> 
#> $coef_variance_matrix
#>           Intercept
#> Intercept 0.3879792
#> 
#> $n_approx
#> [1] 0.07104773
#> 
#> $df
#> [1] 1.071048
#> 
#> $s2
#> [1] 0.1897697
#> 
#> $n_obs
#> [1] 0
#>

Fit a single linear probabilistic dropout model

Arguments

Value

Examples

Contents