You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation has some nice features for handling iterative data and provides early exit conditions. Unfortunately, these features are harder to maintain as we need to handle more use cases and different data structures. A couple of examples of this include:
AbstractContext: while this concept is nice for handling early exits and iterative threshold checks. It's also a bit cumbersome and complicates adding new imputation methods. Also, implementing a new AbstractContext type isn't entirely intuitive.
in-place: This kind of imputation doesn't really make sense for things like iterators over tables. Perhaps it'd be best to simply remove that option altogether... or limit it to arrays?
missing: It's a pretty ubiquitous in Julia now, so maybe we should just expect people to replace it before imputing?
dims: We should explicitly use this to apply an imputation method along a dimension. Maybe we can special-case symbols :rows/:cols for processing a columntable or a rowtable? I suppose we could expect folks to explicitly pass in a columntable or rowtable, but that seems a little unfriendly from a usability standpoint.
Proposed Changes
Drop AbstractContext and maybe replace it with some or all of the below: - Impute.replace!: which will handle the allowmissing call and could support replacing values in multiple columns at once. - Impute.assert?: if you want to throw an error if some missing data threshold is reached [trivial in most cases] - Impute.mask?: will just give you a binary mask over your input data [trivial in most cases]
Add an Impute.filter option which will filter observations base on some threshold. Along a dims would probably be more general. This is also probably more general than dropobs and dropvars?
Drop public API support for in-place imputation rather than having undefined behaviour
Add a Tables.jl like interface for describing whether imputation methods support vectors, matrices, tables or arbitrary iterators of some eltype. This would help us produce better error messages when someone uses an invalid imputation methods rather than giving a long traceback to a rather opaque method error.
Add a MCAR test check
Add a data generation module which would be helpful for tests, but could also be used for method comparisons.
Out of scope
Handling various missing values
Handling non-standard data types like intervals of zoned datetimes. We're gonna limit ourselves to standard datatype like ints, floats, bools, chars and strings. Anything else should have reasonable fallbacks, but if you want to have a timeseries specific algorithm that cares about zoned datetimes then that should be done in a different package.
Most of the standard use-cases we've been dealing with don't include streaming data, so if you want an iteration based approach that might be best left to a separate package with different/simpler imputation methods.
Success Conditions
Easier to write new imputation methods
No undefined behaviour
More composable API
Shouldn't be slower than existing tooling (e.g., replace, filter)
Failure Conditions
It isn't much easier to write new imputation methods (e.g., similar lines of code, about as readable) and we've also gained the trade-offs below
Trade-offs are severe in even the simplest of cases. Existing test cases should be used as benchmarks for checking this.
Most operations can be replace with an existing method
Error conditions are harder to debug
Trade-offs
Not handling different missing values in a context will involve an extra step (pass)
Not handling missing thresholds in a context will involve yet another extra step (pass)
Our move away from arbitrary iterators may limit usage and makes it easier to write multi-pass algorithms that may be necesary, but slower
Dropping attempts at in-place methods may impose an added allocation penalty for some users
Ideally we would standardise how to handle dims and AxisSets.Patterns across Impute.jl, FeatureTransforms.jl, and AxisSets.jl, in which the former two are both supported.
I think the main difference in handling is that FeatureTransforms.jl supports dims=:, which means apply a transform element-wise over an array.
We should also use a consistent convention for what e.g. dims=1 means (outdated but relevant issue: invenia/FeatureTransforms.jl#18)
Overview
The current implementation has some nice features for handling iterative data and provides early exit conditions. Unfortunately, these features are harder to maintain as we need to handle more use cases and different data structures. A couple of examples of this include:
AbstractContext
type isn't entirely intuitive.:rows
/:cols
for processing a columntable or a rowtable? I suppose we could expect folks to explicitly pass in acolumntable
orrowtable
, but that seems a little unfriendly from a usability standpoint.Proposed Changes
DropAbstractContext
and maybe replace it with some or all of the below:-Impute.replace!
: which will handle theallowmissing
call and could support replacing values in multiple columns at once.-Impute.assert
?: if you want to throw an error if some missing data threshold is reached [trivial in most cases]-Impute.mask
?: will just give you a binary mask over your input data [trivial in most cases]Add anImpute.filter
option which will filter observations base on some threshold. Along adims
would probably be more general. This is also probably more general thandropobs
anddropvars
?Out of scope
Success Conditions
replace
,filter
)Failure Conditions
Trade-offs
Related Issues & PRs
skipmissing
instead ofdrop
infill
? #50fill
withvalue=f::Function
isn't comfortable with entirelymissing
data #51The text was updated successfully, but these errors were encountered: