Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use skipmissing instead of drop in fill? #50

Closed
nickrobinson251 opened this issue Aug 29, 2019 · 6 comments
Closed

Use skipmissing instead of drop in fill? #50

nickrobinson251 opened this issue Aug 29, 2019 · 6 comments
Milestone

Comments

@nickrobinson251
Copy link
Contributor

No description provided.

@rofinn
Copy link
Member

rofinn commented Aug 29, 2019

skipmissing wouldn't work if we change the missingness function (e.g., isnan, x -> x == 999999). Many other datasets use different sentinel values.

@nickrobinson251
Copy link
Contributor Author

ah, i didn't realise it was a goal of the package to impute non-missing values. If that's not documented, perhaps it'd be worth adding somewhere?

Feel free to close this :)

@rofinn
Copy link
Member

rofinn commented Aug 29, 2019

It's only really documented for the Context type and isn't currently used in any examples. I'll leave this open till that's added. If we move in the direction of having an Impute.Iterators module then the behaviour of Impute.Iterators.drop and skipmissing should become almost identical in the base case. Might be a good thing to test against though :)

@nickrobinson251
Copy link
Contributor Author

skipmissing wouldn't work if we change the missingness function

The more I think about this, the less I like.

Julia gives us missing, which is a monumentally useful thing, and I think it is reasonable to build for that case. If users need to replace(X, 999999 => missing) then they can do that before imputing

@rofinn
Copy link
Member

rofinn commented Sep 2, 2019

Agreed. That’s why I’d like to move the current behaviour to an iterators module and default to using a multipass approach with an Impute.Dataset type. I'll note that most of these design decisions were made when Missing and Nullable we’re both things, which is less relevant now that julia provides missing by default.

A couple notes on how I think this should exist in the Impute.Iters API.

  1. I feel like if it isn't hard to continue supporting arbitrary missingness functions I don't see a reason not to support it in the iterators interface as it allows you to perform these operations in a single pass through the data rather than requiring multiple passes.
  2. In the case of fill, it probably shouldn't be applying a function over all of the non-missing data in the interator interface and should instead be using something like an OnlineStat if a single pass is the desire behaviour. If you're willing to do multiple passes then just manually create an Impute.Dataset type with a custom mask.

This was referenced Sep 25, 2020
@rofinn rofinn added this to the 0.6 milestone Oct 10, 2020
@rofinn
Copy link
Member

rofinn commented Oct 16, 2020

That's exactly what's happing in the new Impute.substitute call introduced in #69

https://github.com/invenia/Impute.jl/blob/master/src/imputors/substitute.jl#L51

@rofinn rofinn closed this as completed Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants