[feature] Expose argo synchronization options #6553

jmcarp · 2021-09-13T02:13:55Z

Feature Area

/area sdk

What feature would you like to see?

Argo includes useful synchronization features for workflows and templates: https://argoproj.github.io/argo-workflows/synchronization/. It would be helpful to expose these in kfp without having to manipulate manifests as yaml.

I'd be happy to work on this if the request makes sense. It seems like workflow and template level synchronization options would live on the pipeline config and container op classes respectively, but let me know if there's a better place to put them.

What is the use case or pain point?

Allow users to set a semaphore on a workflow or a step.

Is there a workaround currently?

Edit generated yaml.

Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.

jmcarp · 2021-09-14T02:20:10Z

@Bobgy @capri-xiyue: if this request makes sense, let me know if there's a good place to add these options in the dsl, and I can write up a patch.

zijianjoy · 2021-09-16T23:41:48Z

Hello @jmcarp , what is the CUJ for setting limitation of synchronization on 1)workflow level, 2) step level?

jmcarp · 2021-09-17T03:44:28Z

Hi @zijianjoy, the typical use case in either case is to prevent concurrent execution of tasks that could otherwise change some shared resource. For example, if there's a race condition between tasks, or between multiple runs of the same task, it's helpful to use argo synchronization to prevent those tasks from running concurrently.

This is a useful argo feature, and it would be helpful to have it in kfp as well.

capri-xiyue · 2021-09-20T22:23:02Z

Hi @zijianjoy, the typical use case in either case is to prevent concurrent execution of tasks that could otherwise change some shared resource. For example, if there's a race condition between tasks, or between multiple runs of the same task, it's helpful to use argo synchronization to prevent those tasks from running concurrently.

This is a useful argo feature, and it would be helpful to have it in kfp as well.

Hi @jmcarp , in your use case, can you add synchronization logic for the shared resources? I think when it comes to shared resources, it also makes sense that the synchronization logic is added to the shared resources instead of the workflows.

jmcarp · 2021-09-21T14:22:12Z

I can implement my own synchronization, but that's the kind of feature I expect the workflow engine to provide--it's easy to get wrong, so ideally it's implemented once at the workflow layer and not independently by each user of the workflow tool. That's why tools like argo workflows and apache airflow include synchronization primitives, for example.

jmcarp · 2021-09-30T23:40:43Z

I see a 👍 from @Bobgy. Would you accept a patch for this?

Bobgy · 2021-10-19T09:03:23Z

I think that sounds like a fair argument, but regarding project status, we are transitioning to v2, so it's better reconsider this feature request after other critical features are ready.

stale · 2022-03-02T15:04:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kdubovikov · 2023-08-23T12:35:55Z

Is there any update on that feature? We are currently using kfp 1.8 and are evaluating possibilities of migration towards v2.x. However, we rely heavily on Argo synchronization features, which we were able to utilize by patching Argo YAML that Kubeflow compiler was outputting. With IR YAML this will become impossible, since we won't have direct control over Argo YAML

kdubovikov · 2023-09-01T09:43:28Z

@Bobgy hi. Since 2.0 is officially released for quite some time, should this feature that allows semaphore usage get more attention?

github-actions · 2024-06-16T07:42:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gregsheremeta · 2024-06-20T14:42:46Z

gonna pick this up / not stale

github-actions · 2024-08-20T07:41:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gregsheremeta · 2024-08-20T10:00:28Z

working on this

gregsheremeta · 2024-08-20T10:00:41Z

/assign @gregsheremeta

gregsheremeta · 2024-09-09T21:59:13Z

It seems like workflow and template level synchronization options would live on the pipeline config and container op classes respectively, but let me know if there's a better place to put them.

I've been working on this, and I don't think template/component level is going to be trivial to get working.

Turns out all Components share a single template in the generated Workflow, and it looks roughly like this:

    - container:
        command:
          - should-be-overridden-during-runtime
          SNIP
        image: gcr.io/ml-pipeline/should-be-overridden-during-runtime
        name: ''
        resources: {}
        volumeMounts:
          - mountPath: /kfp-launcher
            SNIP
      initContainers:
        - command:
            - launcher-v2
            SNIP
      podSpecPatch: '{{inputs.parameters.pod-spec-patch}}'
      inputs:
        parameters:
          - name: pod-spec-patch
      name: system-container-impl
      metadata:
            SNIP
      outputs: {}
      volumes:
        - emptyDir: {}
          name: kfp-launcher
            SNIP

The Argo Workflows synchronization block for templates goes at that top level, alongside container, initContainer, etc. Since there is only one shared system-container-impl template for all user Components/Tasks, the semaphore or mutex name for a Component can't be set here.

We do have a dynamic way to set the things that change per each Component (command, image, volumes, etc.) -- podSpecPatch. Unfortunately, podSpecPatch is just what it sounds like ... it uses the PodSpec type to do the strategic merge on top of the (static) template at runtime. PodSpec has no synchronization fields (since that is a purely Argo Workflows concept), so there's no way to use PodSpec to dynamically set synchronization fields.

Possible paths forward that I can think of:

Give up on Component/Task level synchronization and settle for Pipeline (Workflow) level.
Move away from a single shared template for all Components. Seems like a large, complex change.
Figure out some way other than podSpecPatch to dynamically inject things into an Argo Workflows template. Maybe something like this already exists that isn't podSpecPatch, or perhaps it would have to be an Argo Workflows enhancement.

gregsheremeta · 2024-09-12T20:50:21Z

Not sure if any interested parties are still following this issue, but if so, I have some design questions.

Per my previous comment, I'm focusing on Pipeline (Workflow) level synchronization (locking) for now. I have a prototype working, but I have some questions about how we should move forward.

The pipeline code looks like this (note PipelineConfig is not yet merged, but can be found in #11112):

from kfp import dsl
from kfp.dsl import PipelineConfig


@dsl.component
def empty_component():
    pass

config = PipelineConfig()
config.set_semaphore_name("semaphore-mutex-pipeline")

@dsl.pipeline(name='semaphore-mutex-pipeline', pipeline_config=config)
def semaphore_mutex_pipeline():
    task = empty_component()
    task.set_caching_options(False)

The semaphore configmap looks like this:

kind: ConfigMap
apiVersion: v1
metadata:
  name: semaphore-mutex-pipeline
  namespace: kubeflow
data:
  semaphore: '1'

When I almost-simultaneously start 3 runs of this pipeline, I expect that the first one will start, and the other two will wait until the lock frees up; then the second one will start and complete; and then the last one will go. And this is what happens. This flow renders quite nicely in the Argo Workflows UI (reverse chronological order):

You can see that there are nice informational messages, and the run state icons are pretty clear.

When everything is fully complete, you can see how the start and end times were staggered.

Each run takes 40 seconds, and notice how the start and finish times are staggered by 40 seconds.

However, in KFP, we don't know anything about what's holding things up in Argo Workflows. So KFP thinks all 3 started at the same time. It thinks run 1 takes ~40 seconds. It thinks run 2 takes ~80 seconds. It thinks run 3 takes ~120 seconds. That renders like this:

It's not horrible, but it's not great either, and it makes me wonder if the very simplistic approach where we treat locks as "just another random thing to set on the workflow" is the right approach. If we want to build lock awareness into the KFP UI, that might require adding the lock concept as a first class citizen in the KFP (apiserver) API.

Another thing that's making me ask this question is what happens when the semaphore isn't on the cluster. The Workflow fails very quickly when this happens. In the Argo UI, we get a very nice message about the reason. In the KFP UI, there's just a generic pipeline failed message -- nothing about the lock configmap missing.

Do folks think it's acceptable to treat locks as "just another random thing to set on the workflow" and thereby have KFP be basically ignorant of what's going on, or would it be better to make KFP more aware of locking and add first-class support for it to the apiserver and the UI?

github-actions · 2024-11-12T07:41:37Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gregsheremeta · 2024-11-12T15:14:56Z

/remove-lifecycle stale
/lifecycle frozen

jmcarp added the kind/feature label Sep 13, 2021

google-oss-robot added the area/sdk label Sep 13, 2021

zijianjoy added the needs more info label Sep 16, 2021

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 23, 2023

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 16, 2024

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 20, 2024

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 20, 2024

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 20, 2024

google-oss-prow bot assigned gregsheremeta Aug 20, 2024

gregsheremeta mentioned this issue Sep 11, 2024

feat(sdk): add PipelineConfig to DSL to re-implement pipeline-level config #11112

Merged

1 task

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 12, 2024

google-oss-prow bot added lifecycle/frozen and removed lifecycle/stale The issue / pull request is stale, any activities remove this label. labels Nov 12, 2024

DharmitD linked a pull request Nov 13, 2024 that will close this issue

feat(backend): Add Semaphore and Mutex fields to Workflow CR #11370

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Expose argo synchronization options #6553

[feature] Expose argo synchronization options #6553

jmcarp commented Sep 13, 2021

jmcarp commented Sep 14, 2021

zijianjoy commented Sep 16, 2021

jmcarp commented Sep 17, 2021

capri-xiyue commented Sep 20, 2021 •

edited

Loading

jmcarp commented Sep 21, 2021

jmcarp commented Sep 30, 2021

Bobgy commented Oct 19, 2021

stale bot commented Mar 2, 2022

kdubovikov commented Aug 23, 2023

kdubovikov commented Sep 1, 2023

github-actions bot commented Jun 16, 2024

gregsheremeta commented Jun 20, 2024

github-actions bot commented Aug 20, 2024

gregsheremeta commented Aug 20, 2024

gregsheremeta commented Aug 20, 2024

gregsheremeta commented Sep 9, 2024

gregsheremeta commented Sep 12, 2024

github-actions bot commented Nov 12, 2024

gregsheremeta commented Nov 12, 2024

[feature] Expose argo synchronization options #6553

[feature] Expose argo synchronization options #6553

Comments

jmcarp commented Sep 13, 2021

Feature Area

What feature would you like to see?

What is the use case or pain point?

Is there a workaround currently?

jmcarp commented Sep 14, 2021

zijianjoy commented Sep 16, 2021

jmcarp commented Sep 17, 2021

capri-xiyue commented Sep 20, 2021 • edited Loading

jmcarp commented Sep 21, 2021

jmcarp commented Sep 30, 2021

Bobgy commented Oct 19, 2021

stale bot commented Mar 2, 2022

kdubovikov commented Aug 23, 2023

kdubovikov commented Sep 1, 2023

github-actions bot commented Jun 16, 2024

gregsheremeta commented Jun 20, 2024

github-actions bot commented Aug 20, 2024

gregsheremeta commented Aug 20, 2024

gregsheremeta commented Aug 20, 2024

gregsheremeta commented Sep 9, 2024

gregsheremeta commented Sep 12, 2024

github-actions bot commented Nov 12, 2024

gregsheremeta commented Nov 12, 2024

capri-xiyue commented Sep 20, 2021 •

edited

Loading