-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loosening the branching constraint #3
Comments
Oh, but reusing the checkpoint is precisely one of the main goals of the branching structure. How do you save your checkpoint files? You should use Kleio to do that so you can seamlessly load a branched trial and work on it as if it was a single tria from time 0 to T. For example, say you saved your checkpoint in import io
from kleio.client.logger import kleio_logger
import torch
file_like_object = io.BytesIO()
torch.save(dict(model=model.state_dict(), optimizer=optimizer.state_dict()),
file_like_object)
file_like_object.seek(0)
# Here epoch and iteration are just meta_data saved with the checkpoint to help
# querying latter. You can add any key that will be saved as meta_data as long as
# it is light weight and serializable by bson. The large object is `file_like_object`.
kleio_logger.log_artifact(filename, file_like_object, epoch=epoch, iteration=iteration) This means the file used for the checkpoint is indexed in the trial as an artefact and you can latter access it using kleio. It also means that any branched trial from this one will inherit from this artifact, given that they are branched after the artefact (you can branch a trial at any time, not just the end of execution). Say you now execute it with a different environment using artifacts = kleio_logger.load_artifacts(filename, {})
# Iterate on it to get last artefact without explicitly building a list
artifact = None
for artifact in artifacts:
continue
if artifact is None:
print("No artifacts found")
return None
# epoch, iteration and other keys are fetched in metadata here
file_like_object, metadata = artifact
# Use `.download()` to actually read the local file (or download from mongodb.gridfs)
state_dict = torch.load(file_like_object.download())
model.load_state_dict(state_dict['model'])
optimizer.load_state_dict(state_dict['optimizer'])
print('Resuming from epoch: {}'.format(metadata['epoch'])) Note that if One nice thing about that is that you could for instance revert your code and run |
Oh I see! You already take care of this issue using artifacts! Ok, thanks, I'll try this out and get back to you. |
Hi,
I have a question related how branching works in this, and if it is possible to play around with this constraint. I'll give an example.
Suppose I have some experiment
experiment1
, with the hashaaabbb
. When I run this experiment, it will periodically dump checkpoints and output files to say,$ROOT/aaabbb/results
. Suppose I have trained this experiment for 50 epochs already (and therefore the results folder contains a checkpoint file for epoch 50).Now suppose in the next few days I decide to commit various changes to parts of the code. These code changes could be things which impact very little / nothing in experiment
aaabbb
(e.g. suppose I add new files irrelevant to that experiment, or change the way metrics are logged). If I decide to resume experimentaaabbb
and train it for another 50 epochs, it won't allow me to since the environment has changed. If I decide to resumeaaabbb
using something like--allow-any-change
then it will simply branch off that experiment and start over again (suppose the branched experiment is calledcccddd
). This means that the experiment will actually be run again from scratch, sincecccddd
is definitely not going to inherit the checkpoint file fromaaabbb
. This leaves us with a few options:aaabbb
without branching (but maybe for peace of mind, somehow let the user know that these are 'dirty' experiments, so that they can separate them from the truly reproducible experiments ;) )aaabbb
outside of usingkleio
. But this means you'll have to account for the fact thatkleio
sets the working directory to be$ROOT/<id>
instead of$ROOT
. (I don't really like this one)$ROOT/<id>
to$ROOT/<id_branched>
, though this might use a lot of disk space if$ROOT/<id>
already has a ton of stuff in it (like big checkpoint files).--resume_from=$ROOT/aaabbb/epoch_50.pkl
. Maybe this is a reasonable one? I assume this is already possible. Though (1) would definitely be the most convenient for me (I may not always want to spread my experiment over multiple branches).Thanks!
The text was updated successfully, but these errors were encountered: