The function fits a linear probabilistic dropout model and infers the hyper-parameters for the location prior, the variance prior, and the dropout curves. In addition it infers for each protein the coefficients that best explain the observed data and the associated uncertainty.
proDA(data, design = ~1, col_data = NULL, reference_level = NULL, data_is_log_transformed = TRUE, moderate_location = TRUE, moderate_variance = TRUE, location_prior_df = 3, n_subsample = nrow(data), max_iter = 20, epsilon = 0.001, verbose = FALSE, ...)
data | a matrix like object ( |
---|---|
design | a specification of the experimental design that
is used to fit the linear model. It can be a |
col_data | a data.frame with one row for each sample in
|
reference_level | a string that specifies which level in a
factor coefficient is used for the intercept. Default:
|
data_is_log_transformed | the raw intensities from mass
spectrometry experiments have a linear mean-variance relation.
This is undesirable and can be removed by working on the log
scale. The easiest way to find out if the data is already log-
transformed is to see if the intensities are in the range of
`0` to `100` in which case they are transformed or if they rather
are between `1e5` to `1e12`, in which case they are not.
Default: |
moderate_location, moderate_variance | boolean values
to indicate if the location and the variances are
moderated. Default: |
location_prior_df | the number of degrees of freedom used
for the location prior. A large number (> 30) means that the
prior is approximately Normal. Default: |
n_subsample | the number of proteins that are used to estimate the
hyper-parameter. Reducing this number can speed up the fitting, but
also mean that the final estimate is less precise. By default all
proteins are used. Default: |
max_iter | the maximum of iterations |
epsilon | if the remaining error is smaller than |
verbose | boolean that signals if the method prints messages
during the fitting. Default: |
... | additional parameters for the construction of the 'proDAFit' object |
An object of class 'proDAFit'. The object contains information
on the hyper-parameters and feature parameters, the convergence,
the experimental design etc. Internally, it is a sub-class of
SummarizedExperiment
which means the object is subsettable.
The `$`-operator is overloaded for this object to make it easy to
discover applicable functions.
By default, the method is moderating the locations and the variance of each protein estimate. The variance moderation is fairly standard in high-throughput experiments and can boost the power to detect differentially abundant proteins. The location moderation is important to handle the edge case where in one condition a protein is not observed in any sample. In addition it can help to get more precise estimates of the difference between conditions. Unlike 'DESeq2', which moderates the coefficient estimates (ie. the "betas") to be centered around zero, 'proDA' penalizes predicted intensities that strain far from the other observed intensities.
# Quick start # Import the proDA package if you haven't already done so # library(proDA) set.seed(1) syn_data <- generate_synthetic_data(n_proteins = 10) fit <- proDA(syn_data$Y, design = syn_data$groups) fit#> Parameters of the probabilistic dropout model #> #> The dataset contains 6 samples and 10 proteins #> 25% of the values are missing #> #> Experimental design was specified using a design matrix (design(object)). #> The model has successfully converged. #> #> The inferred parameters are: #> location_prior_mean: 19.9 #> location_prior_scale: 1.37 #> location_prior_df: 3 #> variance_prior_scale: 0.0443 #> variance_prior_df: 1.53 #> dropout_curve_position: 19.4, 18.7, 19.4, 19.3, ... #> dropout_curve_scale: -0.449, -0.53, -0.0001, -0.0001, ...result_names(fit)#> [1] "Condition_1" "Condition_2"#> name pval adj_pval diff t_statistic se df #> 1 protein_1 0.37987967 0.7597593 0.36909141 0.9861700 0.37426753 4 #> 2 protein_2 0.84241881 0.9360209 0.03563891 0.2120721 0.16805089 4 #> 3 protein_3 0.31441413 0.7597593 0.19847517 1.1494762 0.17266575 4 #> 4 protein_4 0.79872071 0.9360209 -0.08418268 -0.2725078 0.30891847 4 #> 5 protein_5 0.37920981 0.7597593 0.14706464 0.9877101 0.14889454 4 #> 6 protein_6 0.78483892 0.9360209 -0.08266169 -0.2919524 0.28313414 4 #> 7 protein_7 0.58407104 0.9360209 0.17598840 0.5946679 0.29594400 4 #> 8 protein_8 0.05820367 0.2910184 -0.24336063 -2.6297135 0.09254264 4 #> 9 protein_9 0.96944777 0.9694478 0.01624257 0.0407504 0.39858678 4 #> 10 protein_10 0.04087050 0.2910184 -1.23024393 -2.9767406 0.41328556 4 #> avg_abundance n_approx n_obs #> 1 19.54934 4.6803838 5 #> 2 19.17398 4.3634980 4 #> 3 20.43003 6.0000000 6 #> 4 18.25093 2.0068218 2 #> 5 20.89831 6.0000000 6 #> 6 18.92039 0.7575498 0 #> 7 19.62834 3.7284625 4 #> 8 20.56572 6.0000003 6 #> 9 20.09637 6.0000000 6 #> 10 21.99620 6.0000000 6# SummarizedExperiment se <- generate_synthetic_data(n_proteins = 10, return_summarized_experiment = TRUE) se#> class: SummarizedExperiment #> dim: 10 6 #> metadata(0): #> assays(2): abundances full_observations #> rownames(10): protein_1 protein_2 ... protein_9 protein_10 #> rowData names(4): changed true_s2 true_Condition_1 true_Condition_2 #> colnames(6): Condition_1-1 Condition_1-2 ... Condition_2-2 #> Condition_2-3 #> colData names(3): group true_dropout_curve_position #> true_dropout_curve_scaleproDA(se, design = ~ group)#> Parameters of the probabilistic dropout model #> #> The dataset contains 6 samples and 10 proteins #> 23.3% of the values are missing #> #> Experimental design: y~group #> The model has successfully converged. #> #> The inferred parameters are: #> location_prior_mean: 20.3 #> location_prior_scale: 7.21 #> location_prior_df: 3 #> variance_prior_scale: 0.0793 #> variance_prior_df: 1.35 #> dropout_curve_position: 18.9, 18.3, 19.3, 18.6, ... #> dropout_curve_scale: -0.0001, -1.5, -0.0001, -0.0001, ...# Design using model.matrix() data_mat <- matrix(rnorm(5 * 10), nrow=10) colnames(data_mat) <- paste0("sample", 1:5) annotation_df <- data.frame(names = paste0("sample", 1:5), condition = c("A", "A", "A", "B", "B"), age = rnorm(5, mean=40, sd=10)) design_mat <- model.matrix(~ condition + age, data=annotation_df) design_mat#> (Intercept) conditionB age #> 1 1 0 21.30211 #> 2 1 0 44.82030 #> 3 1 0 44.56136 #> 4 1 1 36.46600 #> 5 1 1 41.70489 #> attr(,"assign") #> [1] 0 1 2 #> attr(,"contrasts") #> attr(,"contrasts")$condition #> [1] "contr.treatment" #>proDA(data_mat, design_mat, col_data = annotation_df)#> Parameters of the probabilistic dropout model #> #> The dataset contains 5 samples and 10 proteins #> 0% of the values are missing #> #> Experimental design was specified using a design matrix (design(object)). #> The model has successfully converged. #> #> The inferred parameters are: #> location_prior_mean: 0.326 #> location_prior_scale: 0.383 #> location_prior_df: 3 #> variance_prior_scale: 0.681 #> variance_prior_df: > 100 #> dropout_curve_position: NA, NA, NA, NA, ... #> dropout_curve_scale: NA, NA, NA, NA, ...