Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question - Is there a way to find variables for smooth components in mgcv::gam? #553

Open
stefanocoretta opened this issue Apr 15, 2022 · 18 comments · Fixed by #573
Open
Labels
Question ⁉️ Further information is requested

Comments

@stefanocoretta
Copy link

Hello! Thanks for this package, it's so great!

I have a question about GAMs with mgcv.

I wonder if there is a function to programmatically find variables based on smooth term strings (without having to regex the string).

For example:

library(mgcv)
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)

there are four smooth terms and I would like to be able to extract the variables in the terms, so that for example from "s(x0) I get "x0" and so on (in principle regexing would work, but smooth specifications can get so complicated that it's a bit of a puzzle making sure you get indeed the variable).

Is this possible with insight?

@IndrajeetPatil
Copy link
Member

Hi, yes, this is possible!

You can use the following function:

library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data...
dat <- gamSim(1, n = 400, dist = "normal", scale = 2)
#> Gu & Wahba 4 term additive model
b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat)

library(insight)
find_variables(b)
#> $response
#> [1] "y"
#> 
#> $conditional
#> [1] "x0" "x1" "x2" "x3"

Created on 2022-04-15 by the reprex package (v2.0.1)

@IndrajeetPatil
Copy link
Member

Have a look at the docs to see additional customizations you can do with it:
https://easystats.github.io/insight/reference/find_variables.html

@IndrajeetPatil IndrajeetPatil added the Question ⁉️ Further information is requested label Apr 15, 2022
@strengejacke
Copy link
Member

strengejacke commented Apr 15, 2022

And also find_smooth().

@strengejacke
Copy link
Member

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3,data=dat)

find_variables(b)
#> $response
#> [1] "y"
#> 
#> $conditional
#> [1] "x0" "x1" "x2" "x3"

find_smooth(b)
#> $smooth_terms
#> [1] "s(x1)" "s(x2)"

find_terms(b)
#> $response
#> [1] "y"
#> 
#> $conditional
#> [1] "x0"    "s(x1)" "s(x2)" "x3"

Created on 2022-04-15 by the reprex package (v2.0.1)

@stefanocoretta
Copy link
Author

Hi! I am not sure that answers my question.

What I am trying to achieve is returning the variables inside the smooths after finding the smooths.

Pseudo-code example:

b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1, x3),data=dat)

smooths <- find_smooth(b)
smooths
#> $smooth_terms
#> [1] "s(x1)" "s(x2)" "s(x1, x3)"

find_vars_from_smooth(smooths)
#> $`s(x1)`
#>[1] "x1"
#>
#>$`s(x2)`
#>[1] "x2"
#>
#>$`s(x1, x3)`
#>[1] "x1" "x3"

@strengejacke
Copy link
Member

ok, then just use clean_names() on the output of find_smooth():

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3,data=dat)

find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2"

Created on 2022-04-16 by the reprex package (v2.0.1)

@stefanocoretta
Copy link
Author

Unfortunately, it doesn't work correctly:

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
set.seed(2) ## simulate some data... 
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2),data=dat)

find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2" "x1"

The third smooth should return c("x1", "x2"). Have not tried with by but I assume it would not work correctly either.

@etiennebacher
Copy link
Member

@stefanocoretta It should work now:

library(insight)
library(mgcv)
#> Le chargement a nécessité le package : nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.

set.seed(2)
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model
b <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2), data=dat)

find_smooth(b, flatten = TRUE) |> clean_names()
#> [1] "x1"     "x2"     "x1, x2"

d <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,x2, k = -1), data=dat)

find_smooth(d, flatten = TRUE) |> clean_names()
#> [1] "x1"     "x2"     "x1, x2"

Created on 2022-06-07 by the reprex package (v2.0.1)

@strengejacke
Copy link
Member

I'm not super-familiar with smooth-terms (I think, @DominiqueMakowski startet using them some time ago), but when is it important to include a variable? E.g. here, should the last line return #> [1] "x1" "x2" "x1" or #> [1] "x1" "x2" "x1, x2"?

library(insight)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.

set.seed(2)
dat <- gamSim(1,n=400,dist="normal",scale=2)
#> Gu & Wahba 4 term additive model

d <- gam(y~x0+s(x1)+s(x2)+x3+s(x1,by = x2, k = -1), data=dat)

find_smooth(d, flatten = TRUE)
#> [1] "s(x1)"                  "s(x2)"                  "s(x1, by = x2, k = -1)"

find_smooth(d, flatten = TRUE) |> clean_names()
#> [1] "x1" "x2" "x1"

Created on 2022-06-07 by the reprex package (v2.0.1)

@strengejacke strengejacke reopened this Jun 7, 2022
@DominiqueMakowski
Copy link
Member

Mmh I am not sure what's the expected output in this case, last line should probably return "x1, x2" or "x1:x2" or something like that

@IndrajeetPatil
Copy link
Member

Looks like none of us are sure about this.

Is there anyone in the team who is expert in GAMs?
If not, we can also outsource this to Twitter, where we do know some GAM experts.

@stefanocoretta
Copy link
Author

Hi! It should return all variables in all cases. And the variables should be different elements.

These are some of the possible scenarios

s(time)
s(longitude, latitude)
s(longitude, latitude, altitude)
s(time, by = factor)
s(time, duration, by = factor)
s(time, factor, bs = "fs")
s(factor, bs = "re")
s(factor, time, bs = "re)

Each of those should return:

"time"
c("longitude", "latitude")
c("longitude", "latitude", "altitude")
c("time", "factor")
c("time", "duration", "factor")
c("time", "factor")
"factor"
c("factor", "time")

That is the necessary format for the variables to be used in predict.gam().

@etiennebacher
Copy link
Member

@stefanocoretta since clean_names() returns a character vector, it will only be possible to return e.g "time, factor" and not c("time", "factor"). The only way to return c("time", "factor") would be to change the output format of clean_names() to output a list instead of a character vector, which would break the existing code using clean_names().

@etiennebacher
Copy link
Member

@stefanocoretta There's an example of output in #580

@stefanocoretta
Copy link
Author

It might do although it's a bit inelegant because technically the s() term can have more than one variable, and I would expect clean_name() to return those individually.

In order to be able to use the output further I would have to split the output by ,. Which is ok, although a bit of a hack.

But if that means rewriting the code to accept lists, then your current solution will just do! 😄

@etiennebacher
Copy link
Member

It might do although it's a bit inelegant because technically the s() term can have more than one variable, and I would expect clean_name() to return those individually.

But then there could be duplicates if there are several call to s() in the formula, right? For example, what is the output you would expect for this?

d <- gam(y~s(x1)+s(x2)+s(x1,by = x2, k = -1), data=dat)
find_smooth(d, flatten = TRUE) |> clean_names()

@strengejacke
Copy link
Member

Maybe we could return a character vector in clean_names(), instead of a comma-separated char element. Then it's up to the user to do something like

sapply(insight::find_smooth(d, flatten = TRUE), insight::clean_names, simplify = FALSE)

which will give the information @stefanocoretta requested: a named list (with smooth term names), which elements are the variables used.

@stefanocoretta
Copy link
Author

It might do although it's a bit inelegant because technically the s() term can have more than one variable, and I would expect clean_name() to return those individually.

But then there could be duplicates if there are several call to s() in the formula, right? For example, what is the output you would expect for this?

d <- gam(y~s(x1)+s(x2)+s(x1,by = x2, k = -1), data=dat)
find_smooth(d, flatten = TRUE) |> clean_names()

Correct, they should be reduplicated, because to predict stuff you need to know which smooths have with variables (especially when excluding terms while predicting). The mgcv implementation of GAMs is a bit different in structure from most other models.

So ideally I would expect: [1] "x1" "x2" "x1, x2". Note that often, when a factor is included as a by-variable, it is also included as a parametric effect. For example:

gam(y ~ fac + s(x) + s(x, by = fac))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question ⁉️ Further information is requested
Projects
None yet
5 participants