The function works similar to the classical lm
but with special handling of NA
's. Whereas lm
usually
just ignores response value that are missing, pd_lm
applies
a probabilistic dropout model, that assumes that missing values
occur because of the dropout curve. The dropout curve describes for
each position the chance that that a value is missed. A negative
dropout_curve_scale
means that the lower the intensity was,
the more likely it is to miss the value.
pd_lm(formula, data = NULL, subset = NULL, dropout_curve_position, dropout_curve_scale, location_prior_mean = NULL, location_prior_scale = NULL, variance_prior_scale = NULL, variance_prior_df = NULL, location_prior_df = 3, method = c("analytic_hessian", "analytic_grad", "numeric"), verbose = FALSE)
formula | a formula that specifies a linear model |
---|---|
data | an optional data.frame whose columns can be used to
specify the |
subset | an optional selection vector for data to subset it |
dropout_curve_position | the value where the chance to observe a value is 50%. Can either be a single value that is repeated for each row or a vector with one element for each row. Not optional. |
dropout_curve_scale | the width of the dropout curve. Smaller values mean that the sigmoidal curve is steeper. Can either be a single value that is repeated for each row or a vector with one element for each row. Not optional. |
location_prior_mean, location_prior_scale | the optional mean and variance of the prior around which the predictions are supposed to scatter. If no value is provided no location regularization is applied. |
variance_prior_scale, variance_prior_df | the optional scale and degrees of freedom of the variance prior. If no value is provided no variance regularization is applied. |
location_prior_df | The degrees of freedom for the t-distribution of the location prior. If it is large (> 30) the prior is approximately Normal. Default: 3 |
method | one of 'analytic_hessian', 'analytic_gradient', or
'numeric'. If 'analytic_hessian' the |
verbose | boolean that signals if the method prints informative
messages. Default: |
a list with the following entries
a named vector with the fitted values
a p*p
matrix with the variance associated
with each coefficient estimate
the estimated "size" of the data set (n_hat - variance_prior_df)
the estimated degrees of freedom (n_hat - p)
the estimated unbiased variance
the number of response values that were not `NA`
#> #> Call: #> lm(formula = y ~ 1) #> #> Coefficients: #> (Intercept) #> 20.02 #>pd_lm(y ~ 1, dropout_curve_position = NA, dropout_curve_scale = NA)#> $coefficients #> Intercept #> 20.02302 #> #> $coef_variance_matrix #> Intercept #> Intercept 0.2259314 #> #> $n_approx #> [1] 5 #> #> $df #> [1] 4 #> #> $s2 #> [1] 1.129657 #> #> $n_obs #> [1] 5 #>#> #> Call: #> lm(formula = y ~ 1) #> #> Coefficients: #> (Intercept) #> 22.2 #>pd_lm(y ~ 1, dropout_curve_position = 19, dropout_curve_scale = -1)#> $coefficients #> Intercept #> 20.90007 #> #> $coef_variance_matrix #> Intercept #> Intercept 3.400309 #> #> $n_approx #> [1] 1.451011 #> #> $df #> [1] 0.4510113 #> #> $s2 #> [1] 13.95306 #> #> $n_obs #> [1] 2 #># With only missing values y <- c(NA, NA, NA) # lm(y ~ 1) # Fails pd_lm(y ~ 1, dropout_curve_position = 19, dropout_curve_scale = -1, location_prior_mean = 21, location_prior_scale = 3, variance_prior_scale = 0.1, variance_prior_df = 2)#> $coefficients #> Intercept #> 18.77828 #> #> $coef_variance_matrix #> Intercept #> Intercept 0.3879792 #> #> $n_approx #> [1] 0.07104773 #> #> $df #> [1] 1.071048 #> #> $s2 #> [1] 0.1897697 #> #> $n_obs #> [1] 0 #>