Question - Is there a way to find variables for smooth components in mgcv::gam? #553

stefanocoretta · 2022-04-15T15:30:37Z

Hello! Thanks for this package, it's so great!

I have a question about GAMs with mgcv.

I wonder if there is a function to programmatically find variables based on smooth term strings (without having to regex the string).

For example:

library(mgcv)
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)

there are four smooth terms and I would like to be able to extract the variables in the terms, so that for example from "s(x0) I get "x0" and so on (in principle regexing would work, but smooth specifications can get so complicated that it's a bit of a puzzle making sure you get indeed the variable).

Is this possible with insight?

The text was updated successfully, but these errors were encountered:

IndrajeetPatil · 2022-04-15T15:40:14Z

Hi, yes, this is possible!

You can use the following function:

library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data...
dat <- gamSim(1, n = 400, dist = "normal", scale = 2)
#> Gu & Wahba 4 term additive model
b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat)

library(insight)
find_variables(b)
#> $response
#> [1] "y"
#> 
#> $conditional
#> [1] "x0" "x1" "x2" "x3"

^{Created on 2022-04-15 by the reprex package (v2.0.1)}

IndrajeetPatil · 2022-04-15T15:41:02Z

Have a look at the docs to see additional customizations you can do with it:
https://easystats.github.io/insight/reference/find_variables.html

strengejacke · 2022-04-15T17:47:17Z

And also find_smooth().

strengejacke · 2022-04-15T17:49:15Z

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3,data=dat)

find_variables(b)
#> $response
#> [1] "y"
#> 
#> $conditional
#> [1] "x0" "x1" "x2" "x3"

find_smooth(b)
#> $smooth_terms
#> [1] "s(x1)" "s(x2)"

find_terms(b)
#> $response
#> [1] "y"
#> 
#> $conditional
#> [1] "x0"    "s(x1)" "s(x2)" "x3"

^{Created on 2022-04-15 by the reprex package (v2.0.1)}

stefanocoretta · 2022-04-16T08:40:34Z

Hi! I am not sure that answers my question.

What I am trying to achieve is returning the variables inside the smooths after finding the smooths.

Pseudo-code example:

b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1, x3),data=dat)

smooths <- find_smooth(b)
smooths
#> $smooth_terms
#> [1] "s(x1)" "s(x2)" "s(x1, x3)"

find_vars_from_smooth(smooths)
#> $`s(x1)`
#>[1] "x1"
#>
#>$`s(x2)`
#>[1] "x2"
#>
#>$`s(x1, x3)`
#>[1] "x1" "x3"

strengejacke · 2022-04-16T08:55:49Z

ok, then just use clean_names() on the output of find_smooth():

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3,data=dat)

find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2"

^{Created on 2022-04-16 by the reprex package (v2.0.1)}

stefanocoretta · 2022-04-16T09:53:25Z

Unfortunately, it doesn't work correctly:

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2),data=dat)

find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2" "x1"

The third smooth should return c("x1", "x2"). Have not tried with by but I assume it would not work correctly either.

etiennebacher · 2022-06-07T19:11:49Z

@stefanocoretta It should work now:

library(insight)
library(mgcv)
#> Le chargement a nécessité le package : nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.

set.seed(2)
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2), data=dat)

find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1"     "x2"     "x1, x2"

d <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2, k = -1), data=dat)

find_smooth(d, flatten = TRUE) |> clean_names()
#> [1] "x1"     "x2"     "x1, x2"

^{Created on 2022-06-07 by the reprex package (v2.0.1)}

strengejacke · 2022-06-07T19:51:36Z

I'm not super-familiar with smooth-terms (I think, @DominiqueMakowski startet using them some time ago), but when is it important to include a variable? E.g. here, should the last line return #> [1] "x1" "x2" "x1" or #> [1] "x1" "x2" "x1, x2"?

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.

set.seed(2)
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model

d <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,by = x2, k = -1), data=dat)

find_smooth(d, flatten = TRUE)
#> [1] "s(x1)"                  "s(x2)"                  "s(x1, by = x2, k = -1)"

find_smooth(d, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2" "x1"

^{Created on 2022-06-07 by the reprex package (v2.0.1)}

DominiqueMakowski · 2022-06-08T07:52:51Z

Mmh I am not sure what's the expected output in this case, last line should probably return "x1, x2" or "x1:x2" or something like that

IndrajeetPatil · 2022-06-08T08:06:31Z

Looks like none of us are sure about this.

Is there anyone in the team who is expert in GAMs?
If not, we can also outsource this to Twitter, where we do know some GAM experts.

stefanocoretta · 2022-06-08T08:25:41Z

Hi! It should return all variables in all cases. And the variables should be different elements.

These are some of the possible scenarios

s(time)
s(longitude, latitude)
s(longitude, latitude, altitude)
s(time, by = factor)
s(time, duration, by = factor)
s(time, factor, bs = "fs")
s(factor, bs = "re")
s(factor, time, bs = "re)

Each of those should return:

"time"
c("longitude", "latitude")
c("longitude", "latitude", "altitude")
c("time", "factor")
c("time", "duration", "factor")
c("time", "factor")
"factor"
c("factor", "time")

That is the necessary format for the variables to be used in predict.gam().

etiennebacher · 2022-06-08T14:00:55Z

@stefanocoretta since clean_names() returns a character vector, it will only be possible to return e.g "time, factor" and not c("time", "factor"). The only way to return c("time", "factor") would be to change the output format of clean_names() to output a list instead of a character vector, which would break the existing code using clean_names().

etiennebacher · 2022-06-08T16:26:02Z

@stefanocoretta There's an example of output in #580

stefanocoretta · 2022-06-08T19:47:49Z

It might do although it's a bit inelegant because technically the s() term can have more than one variable, and I would expect clean_name() to return those individually.

In order to be able to use the output further I would have to split the output by ,. Which is ok, although a bit of a hack.

But if that means rewriting the code to accept lists, then your current solution will just do! 😄

etiennebacher · 2022-06-08T20:11:09Z

It might do although it's a bit inelegant because technically the s() term can have more than one variable, and I would expect clean_name() to return those individually.

But then there could be duplicates if there are several call to s() in the formula, right? For example, what is the output you would expect for this?

d <- gam(y~s(x1)+s(x2)+s(x1,by = x2, k = -1), data=dat)
find_smooth(d, flatten = TRUE) |> clean_names()

strengejacke · 2022-06-08T20:24:24Z

Maybe we could return a character vector in clean_names(), instead of a comma-separated char element. Then it's up to the user to do something like

sapply(insight::find_smooth(d, flatten = TRUE), insight::clean_names, simplify = FALSE)

which will give the information @stefanocoretta requested: a named list (with smooth term names), which elements are the variables used.

stefanocoretta · 2022-06-10T15:56:02Z

It might do although it's a bit inelegant because technically the s() term can have more than one variable, and I would expect clean_name() to return those individually.

But then there could be duplicates if there are several call to s() in the formula, right? For example, what is the output you would expect for this?
d <- gam(y~s(x1)+s(x2)+s(x1,by = x2, k = -1), data=dat)
find_smooth(d, flatten = TRUE) |> clean_names()

Correct, they should be reduplicated, because to predict stuff you need to know which smooths have with variables (especially when excluding terms while predicting). The mgcv implementation of GAMs is a bit different in structure from most other models.

So ideally I would expect: [1] "x1" "x2" "x1, x2". Note that often, when a factor is included as a by-variable, it is also included as a parametric effect. For example:

gam(y ~ fac + s(x) + s(x, by = fac))

IndrajeetPatil closed this as completed Apr 15, 2022

IndrajeetPatil added the Question ⁉️ Further information is requested label Apr 15, 2022

IndrajeetPatil reopened this Apr 16, 2022

etiennebacher mentioned this issue Jun 1, 2022

clean_names: correctly extract several variables from mgcv::s() #573

Merged

etiennebacher closed this as completed in #573 Jun 7, 2022

strengejacke reopened this Jun 7, 2022

etiennebacher mentioned this issue Jun 8, 2022

fix clean_names() for mgcv::s() and gam::s() #580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question - Is there a way to find variables for smooth components in mgcv::gam? #553

Question - Is there a way to find variables for smooth components in mgcv::gam? #553

stefanocoretta commented Apr 15, 2022

IndrajeetPatil commented Apr 15, 2022

IndrajeetPatil commented Apr 15, 2022

strengejacke commented Apr 15, 2022 •

edited

Loading

strengejacke commented Apr 15, 2022

stefanocoretta commented Apr 16, 2022

strengejacke commented Apr 16, 2022

stefanocoretta commented Apr 16, 2022

etiennebacher commented Jun 7, 2022

strengejacke commented Jun 7, 2022

DominiqueMakowski commented Jun 8, 2022

IndrajeetPatil commented Jun 8, 2022

stefanocoretta commented Jun 8, 2022

etiennebacher commented Jun 8, 2022

etiennebacher commented Jun 8, 2022

stefanocoretta commented Jun 8, 2022

etiennebacher commented Jun 8, 2022

strengejacke commented Jun 8, 2022

stefanocoretta commented Jun 10, 2022

Question - Is there a way to find variables for smooth components in mgcv::gam? #553

Question - Is there a way to find variables for smooth components in mgcv::gam? #553

Comments

stefanocoretta commented Apr 15, 2022

IndrajeetPatil commented Apr 15, 2022

IndrajeetPatil commented Apr 15, 2022

strengejacke commented Apr 15, 2022 • edited Loading

strengejacke commented Apr 15, 2022

stefanocoretta commented Apr 16, 2022

strengejacke commented Apr 16, 2022

stefanocoretta commented Apr 16, 2022

etiennebacher commented Jun 7, 2022

strengejacke commented Jun 7, 2022

DominiqueMakowski commented Jun 8, 2022

IndrajeetPatil commented Jun 8, 2022

stefanocoretta commented Jun 8, 2022

etiennebacher commented Jun 8, 2022

etiennebacher commented Jun 8, 2022

stefanocoretta commented Jun 8, 2022

etiennebacher commented Jun 8, 2022

strengejacke commented Jun 8, 2022

stefanocoretta commented Jun 10, 2022

strengejacke commented Apr 15, 2022 •

edited

Loading