variables are collinear #41

lwaldron · 2024-06-15T11:15:59Z

This is a new warning being generated by the tests, and I don't think there are actually should be collinear variables here so I'm not sure what's happening.

suppressPackageStartupMessages(library(lefser))
data("zeller14")
zellersub <- zeller14[1:150, zeller14$study_condition != "adenoma"]
## subsetting a DataFrame with NULL
zellersub <- relativeAb(zellersub)
results <- lefser(zellersub, groupCol = "study_condition", blockCol = NULL)
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.
#> Warning in lda.default(x, grouping, ...): variables are collinear

^{Created on 2024-06-15 with reprex v2.1.0}

lwaldron · 2024-06-15T11:46:28Z

Likely cause - the relab_sub_t_dfin the call raw_lda_scores <- ldaFunction(relab_sub_t_df, lgroupf) within lefser() looks like this, when these should be relative abundances:

lwaldron · 2024-06-15T12:18:09Z

Actually these relative abundances are correct, I forgot that they are scaled to add to 1e6. I think this is real collinearity, caused by the presence of synthetic clades (e.g. "Bacteria") where a parent node has only one child, or two children but one is dominant. Two TODOs:

modify tests and examples to do the analysis at species level only.
make warning more helpful, to explain this likely cause of collinearity and recommend analyzing a single taxonomic level only
modify documentation to recommend analyzing a single taxonomic level only

shbrief · 2024-06-16T17:26:58Z

It seems like you are solving this with the get_terminal_nodes function. Is this something you'd like to add to the lefser function?

shbrief · 2024-06-16T21:14:24Z

I also want to clarify points 1 & 3: if the terminal node of an input data is a mix of strain, species, and genus, for example, what will be the recommendation?

shbrief · 2024-06-21T18:50:48Z

9e1afa1

sdgamboa · 2024-09-10T21:46:21Z

One of the features of lefse-conda is determining which clades are differentially abundant.
For example this result. This clade was created by the lefse program.

I think the abundance of synthetic clades (e.g., Bacteria, etc.) is necessary for the cladogram if the internal nodes are of interest. Otherwise, the cladogram could just depict which taxa in the terminal nodes are differentially abundant.

If the taxonomy is included in the rowData instead of the rownames and we restrict (recommend) the input to only terminal nodes at the same taxonomic level we could use the mia package:

library(mia)
data("GlobalPatterns")
l <- splitByRanks(GlobalPatterns)
l$features <- GlobalPatterns
mergedSE <- mergeSEs(l, collapse.cols = TRUE)
## Instead of merging, the lefse analysis could be run at all taxonomic levels individually

lwaldron added the bug Something isn't working label Jun 15, 2024

lwaldron added documentation Improvements or additions to documentation and removed bug Something isn't working labels Jun 15, 2024

lwaldron self-assigned this Jun 15, 2024

This was referenced Sep 14, 2024

WIP - Cladogram plot #61

Closed

Add lefesrPlotClad function #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

variables are collinear #41

variables are collinear #41

lwaldron commented Jun 15, 2024

lwaldron commented Jun 15, 2024

lwaldron commented Jun 15, 2024

shbrief commented Jun 16, 2024

shbrief commented Jun 16, 2024

shbrief commented Jun 21, 2024

sdgamboa commented Sep 10, 2024 •

edited

Loading

variables are collinear #41

variables are collinear #41

Comments

lwaldron commented Jun 15, 2024

lwaldron commented Jun 15, 2024

lwaldron commented Jun 15, 2024

shbrief commented Jun 16, 2024

shbrief commented Jun 16, 2024

shbrief commented Jun 21, 2024

sdgamboa commented Sep 10, 2024 • edited Loading

sdgamboa commented Sep 10, 2024 •

edited

Loading