The function works similar to the classical lm but with special handling of NA's. Whereas lm usually just ignores response value that are missing, pd_lm applies a probabilistic dropout model, that assumes that missing values occur because of the dropout curve. The dropout curve describes for each position the chance that that a value is missed. A negative dropout_curve_scale means that the lower the intensity was, the more likely it is to miss the value.

pd_lm(formula, data = NULL, subset = NULL, dropout_curve_position,
dropout_curve_scale, location_prior_mean = NULL,
location_prior_scale = NULL, variance_prior_scale = NULL,
variance_prior_df = NULL, location_prior_df = 3,
verbose = FALSE)

## Arguments

formula a formula that specifies a linear model an optional data.frame whose columns can be used to specify the formula an optional selection vector for data to subset it the value where the chance to observe a value is 50%. Can either be a single value that is repeated for each row or a vector with one element for each row. Not optional. the width of the dropout curve. Smaller values mean that the sigmoidal curve is steeper. Can either be a single value that is repeated for each row or a vector with one element for each row. Not optional. the optional mean and variance of the prior around which the predictions are supposed to scatter. If no value is provided no location regularization is applied. the optional scale and degrees of freedom of the variance prior. If no value is provided no variance regularization is applied. The degrees of freedom for the t-distribution of the location prior. If it is large (> 30) the prior is approximately Normal. Default: 3 one of 'analytic_hessian', 'analytic_gradient', or 'numeric'. If 'analytic_hessian' the nlminb optimization routine is used, with the hand derived first and second derivative. Otherwise, optim either with or without the first derivative is used. boolean that signals if the method prints informative messages. Default: FALSE.

## Value

a list with the following entries

coefficients

a named vector with the fitted values

coef_variance_matrix

a p*p matrix with the variance associated with each coefficient estimate

n_approx

the estimated "size" of the data set (n_hat - variance_prior_df)

df

the estimated degrees of freedom (n_hat - p)

s2

the estimated unbiased variance

n_obs

the number of response values that were not NA

## Examples

  # Without missing values
y <- rnorm(5, mean=20)
lm(y ~ 1)#>
#> Call:
#> lm(formula = y ~ 1)
#>
#> Coefficients:
#> (Intercept)
#>       20.02
#>   pd_lm(y ~ 1,
dropout_curve_position = NA,
dropout_curve_scale = NA)#> $coefficients #> Intercept #> 20.02302 #> #>$coef_variance_matrix
#>           Intercept
#> Intercept 0.2259314
#>
#> $n_approx #>  5 #> #>$df
#>  4
#>
#> $s2 #>  1.129657 #> #>$n_obs
#>  5
#>
# With some missing values
y <- c(23, 21.4, NA)
lm(y ~ 1)#>
#> Call:
#> lm(formula = y ~ 1)
#>
#> Coefficients:
#> (Intercept)
#>        22.2
#>   pd_lm(y ~ 1,
dropout_curve_position = 19,
dropout_curve_scale = -1)#> $coefficients #> Intercept #> 20.90007 #> #>$coef_variance_matrix
#>           Intercept
#> Intercept  3.400309
#>
#> $n_approx #>  1.451011 #> #>$df
#>  0.4510113
#>
#> $s2 #>  13.95306 #> #>$n_obs
#>  2
#>

# With only missing values
y <- c(NA, NA, NA)
# lm(y ~ 1)  # Fails
pd_lm(y ~ 1,
dropout_curve_position = 19,
dropout_curve_scale = -1,
location_prior_mean = 21,
location_prior_scale = 3,
variance_prior_scale = 0.1,
variance_prior_df = 2)#> $coefficients #> Intercept #> 18.77828 #> #>$coef_variance_matrix
#>           Intercept
#> Intercept 0.3879792
#>
#> $n_approx #>  0.07104773 #> #>$df
#>  1.071048
#>
#> $s2 #>  0.1897697 #> #>$n_obs
#>  0
#>