Identify differentially abundant proteins

The `test_diff()` function is used to test coefficients of a 'proDAFit' object. It provides a Wald test to test individual coefficients and a likelihood ratio F-test to compare the original model with a reduced model. The result_names method provides a quick overview which coefficients are available for testing.

test_diff(fit, contrast, reduced_model = ~1,
  alternative = c("two.sided", "greater", "less"),
  pval_adjust_method = "BH", sort_by = NULL, decreasing = FALSE,
  n_max = Inf, verbose = FALSE)

# S4 method for proDAFit
result_names(fit)

Arguments

fit	an object of class 'proDAFit'. Usually, this is produced by calling `proDA()`
contrast	an expression or a string specifying which contrast is tested. It can be a single coefficient (to see the available options use `result_names(fit)`) or any linear combination of them. The contrast is always compared against zero. Thus, to find out if two coefficients differ use `coef1 - coef2`.
reduced_model	If you don't want to test an individual coefficient, you can can specify a reduced model and compare it with the original model using an F-test. This is useful to find out how a set of parameters affect the goodness of the fit. If neither a `contrast`, nor a `reduced_model` is specified, by default a comparison with an intercept model (ie. just the average across conditions) is done. Default: `~ 1`.
alternative	a string that decides how the hypothesis test is done. This parameter is only relevant for the Wald-test specified using the `contrast` argument. Default: `"two.sided"`
pval_adjust_method	a string the indicates the method that is used to adjust the p-value for the multiple testing. It must match the options in `p.adjust`. Default: `"BH"`
sort_by	a string that specifies the column that is used to sort the resulting data.frame. Default: `NULL` which means the result is sorted by the order of the input matrix.
decreasing	a boolean to indicate if the order is reversed. Default: `FALSE`
n_max	the maximum number of rows returned by the method. Default: `Inf`
verbose	boolean that signals if the method prints informative messages. Default: `FALSE`.

Value

The `result_names()` function returns a character vector.

The `test_diff()` function returns a data.frame with one row per protein with the key parameters of the statistical test. Depending what kind of test (Wald or F test) the content of the `data.frame` differs.

The Wald test, which can considered equivalent to a t-test, returns a `data.frame` with the following columns:

name: the name of the protein, extracted from the rowname of the input matrix
pval: the p-value of the statistical test
adj_pval: the multiple testing adjusted p-value
diff: the difference that particular coefficient makes. In differential expression analysis this value is also called log fold change, which is equivalent to the difference on the log scale.
t_statistic: the diff divided by the standard error se
se: the standard error associated with the diff
df: the degrees of freedom, which describe the amount of available information for estimating the se. They are the sum of the number of samples the protein was observed in, the amount of information contained in the missing values, and the degrees of freedom of the variance prior.
avg_abundance: the estimate of the average abundance of the protein across all samples.
n_approx: the approximated information available for estimating the protein features, expressed as multiple of the information contained in one observed value.
n_obs: the number of samples a protein was observed in

The F-test returns a `data.frame` with the following columns

name: the name of the protein, extracted from the rowname of the input matrix
pval: the p-value of the statistical test
adj_pval: the multiple testing adjusted p-value
f_statistic: the ratio of difference of normalized deviances from original model and the reduced model, divided by the standard deviation.
df1: the difference of the number of coefficients in the original model and the number of coefficients in the reduced model
df2: the degrees of freedom, which describe the amount of available information for estimating the se. They are the sum of the number of samples the protein was observed in, the amount of information contained in the missing values, and the degrees of freedom of the variance prior.
avg_abundance: the estimate of the average abundance of the protein across all samples.
n_approx: the information available for estimating the protein features, expressed as multiple of the information contained in one observed value.
n_obs: the number of samples a protein was observed in

Details

To test if coefficient is different from zero with a Wald test use the contrast function argument. To test if two models differ with an F-test use the reduced_model argument. Depending on the test that is conducted, the functions returns slightly different data.frames.

The function is designed to follow the principles of the base R test functions (ie. t.test and wilcox.test) and the functions designed for collecting the results of high-throughput testing (ie. limma::topTable and DESeq2::results).

Examples

  # "t-test"
  syn_data <- generate_synthetic_data(n_proteins = 10)
  fit <- proDA(syn_data$Y, design = syn_data$groups)
  result_names(fit)
#> [1] "Condition_1" "Condition_2"
  test_diff(fit, Condition_1 - Condition_2)
#>          name       pval  adj_pval        diff t_statistic        se df
#> 1   protein_1 0.63066796 0.8292043  0.13868715   0.5197612 0.2668286  4
#> 2   protein_2 0.70057164 0.8292043 -0.09230335  -0.4133169 0.2233234  4
#> 3   protein_3 0.02960657 0.1480328  1.17063438   3.3117272 0.3534815  4
#> 4   protein_4 0.15879472 0.4800317  0.31384244   1.7293959 0.1814752  4
#> 5   protein_5 0.82920433 0.8292043  0.05036218   0.2302352 0.2187423  4
#> 6   protein_6 0.81128343 0.8292043  0.05249717   0.2550194 0.2058556  4
#> 7   protein_7 0.19201270 0.4800317  0.23846547   1.5677433 0.1521075  4
#> 8   protein_8 0.53561684 0.8292043 -0.14429122  -0.6768613 0.2131769  4
#> 9   protein_9 0.72667595 0.8292043 -0.12732876  -0.3750301 0.3395161  4
#> 10 protein_10 0.01098839 0.1098839 -0.81158997  -4.4801736 0.1811515  4
#>    avg_abundance n_approx n_obs
#> 1       18.22209 3.003094     3
#> 2       20.00984 3.989574     4
#> 3       17.52142 1.363857     1
#> 4       21.28096 6.000000     6
#> 5       21.21086 4.977908     5
#> 6       19.59506 5.345718     5
#> 7       23.08283 6.000000     6
#> 8       19.06041 4.267228     4
#> 9       20.00061 4.983345     5
#> 10      23.52646 6.000000     6


  suppressPackageStartupMessages(library(SummarizedExperiment))
  se <- generate_synthetic_data(n_proteins = 10,
                                n_conditions = 3,
                                return_summarized_experiment = TRUE)
  colData(se)$age <- rnorm(9, mean=45, sd=5)
  colData(se)
#> DataFrame with 9 rows and 4 columns
#>                     group true_dropout_curve_position true_dropout_curve_scale
#>                  <factor>                   <numeric>                <numeric>
#> Condition_1-1 Condition_1                        18.5                     -1.2
#> Condition_1-2 Condition_1                        18.5                     -1.2
#> Condition_1-3 Condition_1                        18.5                     -1.2
#> Condition_2-1 Condition_2                        18.5                     -1.2
#> Condition_2-2 Condition_2                        18.5                     -1.2
#> Condition_2-3 Condition_2                        18.5                     -1.2
#> Condition_3-1 Condition_3                        18.5                     -1.2
#> Condition_3-2 Condition_3                        18.5                     -1.2
#> Condition_3-3 Condition_3                        18.5                     -1.2
#>                            age
#>                      <numeric>
#> Condition_1-1 45.4767700483905
#> Condition_1-2 42.6859029002183
#> Condition_1-3 37.6555892195272
#> Condition_2-1 45.7634325276076
#> Condition_2-2 53.8688130565859
#> Condition_2-3 41.7596453324248
#> Condition_3-1  44.000912621659
#> Condition_3-2 48.4462186648859
#> Condition_3-3  45.180727549183
  fit <- proDA(se, design = ~ group + age)
  result_names(fit)
#> [1] "Intercept"        "groupCondition_2" "groupCondition_3" "age"             
  test_diff(fit, "groupCondition_2",
            n_max = 3, sort_by = "pval")
#>          name       pval  adj_pval     diff t_statistic        se df
#> 1   protein_1 0.05766252 0.3596771 1.186875    2.453854 0.4836778  5
#> 10 protein_10 0.10541452 0.3596771 1.063993    1.973770 0.5390665  5
#> 3   protein_3 0.10790314 0.3596771 1.296400    1.955533 0.6629396  5
#>    avg_abundance n_approx n_obs
#> 1       18.20726 2.643421     1
#> 10      18.11640 4.664379     4
#> 3       20.20987 8.362519     8

  # F-test
  test_diff(fit, reduced_model = ~ group)
#>          name       pval  adj_pval  f_statistic df1      df2 avg_abundance
#> 1   protein_1 0.83525553 0.9985119 5.159081e-02   1 2.926664      18.20726
#> 2   protein_2 0.17019359 0.8509679 2.250220e+00   1 8.403543      20.07761
#> 3   protein_3 0.08813637 0.8509679 3.693300e+00   1 8.645761      20.20987
#> 4   protein_4 0.91314840 0.9985119 1.256128e-02   1 9.283242      23.16640
#> 5   protein_5 0.63759269 0.9985119 2.371096e-01   1 9.283242      21.20843
#> 6   protein_6 0.99851187 0.9985119 3.670676e-06   1 9.283242      22.53936
#> 7   protein_7 0.41659423 0.9985119 7.228662e-01   1 9.283242      20.41743
#> 8   protein_8 0.93818225 0.9985119 7.179518e-03   1 2.790716      18.39177
#> 9   protein_9 0.65665475 0.9985119 2.190582e-01   1 5.861225      19.13118
#> 10 protein_10 0.31029173 0.9985119 1.276726e+00   1 4.947621      18.11640
#>    n_approx n_obs
#> 1  2.643421     1
#> 2  8.120301     8
#> 3  8.362519     8
#> 4  9.000000     9
#> 5  9.000000     9
#> 6  9.000000     9
#> 7  9.000000     9
#> 8  2.507474     1
#> 9  5.577983     5
#> 10 4.664379     4

Identify differentially abundant proteins

Arguments

Value

Details

See also

Examples

Contents