The function works similar to the classical lm but with special handling of NA's. Whereas lm usually just ignores response value that are missing, pd_lm applies a probabilistic dropout model, that assumes that missing values occur because of the dropout curve. The dropout curve describes for each position the chance that that a value is missed. A negative dropout_curve_scale means that the lower the intensity was, the more likely it is to miss the value.

pd_lm(formula, data = NULL, subset = NULL, dropout_curve_position,
  dropout_curve_scale, location_prior_mean = NULL,
  location_prior_scale = NULL, variance_prior_scale = NULL,
  variance_prior_df = NULL, location_prior_df = 3,
  method = c("analytic_hessian", "analytic_grad", "numeric"),
  verbose = FALSE)

Arguments

formula

a formula that specifies a linear model

data

an optional data.frame whose columns can be used to specify the formula

subset

an optional selection vector for data to subset it

dropout_curve_position

the value where the chance to observe a value is 50%. Can either be a single value that is repeated for each row or a vector with one element for each row. Not optional.

dropout_curve_scale

the width of the dropout curve. Smaller values mean that the sigmoidal curve is steeper. Can either be a single value that is repeated for each row or a vector with one element for each row. Not optional.

location_prior_mean, location_prior_scale

the optional mean and variance of the prior around which the predictions are supposed to scatter. If no value is provided no location regularization is applied.

variance_prior_scale, variance_prior_df

the optional scale and degrees of freedom of the variance prior. If no value is provided no variance regularization is applied.

location_prior_df

The degrees of freedom for the t-distribution of the location prior. If it is large (> 30) the prior is approximately Normal. Default: 3

method

one of 'analytic_hessian', 'analytic_gradient', or 'numeric'. If 'analytic_hessian' the nlminb optimization routine is used, with the hand derived first and second derivative. Otherwise, optim either with or without the first derivative is used.

verbose

boolean that signals if the method prints informative messages. Default: FALSE.

Value

a list with the following entries

coefficients

a named vector with the fitted values

coef_variance_matrix

a p*p matrix with the variance associated with each coefficient estimate

n_approx

the estimated "size" of the data set (n_hat - variance_prior_df)

df

the estimated degrees of freedom (n_hat - p)

s2

the estimated unbiased variance

n_obs

the number of response values that were not `NA`

Examples

# Without missing values y <- rnorm(5, mean=20) lm(y ~ 1)
#> #> Call: #> lm(formula = y ~ 1) #> #> Coefficients: #> (Intercept) #> 20.02 #>
pd_lm(y ~ 1, dropout_curve_position = NA, dropout_curve_scale = NA)
#> $coefficients #> Intercept #> 20.02302 #> #> $coef_variance_matrix #> Intercept #> Intercept 0.2259314 #> #> $n_approx #> [1] 5 #> #> $df #> [1] 4 #> #> $s2 #> [1] 1.129657 #> #> $n_obs #> [1] 5 #>
# With some missing values y <- c(23, 21.4, NA) lm(y ~ 1)
#> #> Call: #> lm(formula = y ~ 1) #> #> Coefficients: #> (Intercept) #> 22.2 #>
pd_lm(y ~ 1, dropout_curve_position = 19, dropout_curve_scale = -1)
#> $coefficients #> Intercept #> 20.90007 #> #> $coef_variance_matrix #> Intercept #> Intercept 3.400309 #> #> $n_approx #> [1] 1.451011 #> #> $df #> [1] 0.4510113 #> #> $s2 #> [1] 13.95306 #> #> $n_obs #> [1] 2 #>
# With only missing values y <- c(NA, NA, NA) # lm(y ~ 1) # Fails pd_lm(y ~ 1, dropout_curve_position = 19, dropout_curve_scale = -1, location_prior_mean = 21, location_prior_scale = 3, variance_prior_scale = 0.1, variance_prior_df = 2)
#> $coefficients #> Intercept #> 18.77828 #> #> $coef_variance_matrix #> Intercept #> Intercept 0.3879792 #> #> $n_approx #> [1] 0.07104773 #> #> $df #> [1] 1.071048 #> #> $s2 #> [1] 0.1897697 #> #> $n_obs #> [1] 0 #>