Daylight Simulations Taking a Long Time to Complete

Hi @anthonyschneider,

It looks like the gods of cloud computing were waiting for me to publicly express my confidence in our computing service to humble me! :sweat_smile:

What happened in this case is different from what was happening 3 years ago with the issue that Max reported. I share the solution before I get technical about what happened here. If you see a case like this where some of the runs in the study get stuck, try to cancel the study, and then “Retry Failed Runs”.

This is what I just did with your runs and the simulations run quickly and with no issues.


Now here is a more detailed response about what happened.

Pollination cloud computing is built on top of Argo Workflows which is a workflow engine built on top of Kubernetes. Both technologies are widely used and are part of the Cloud Native Computing Foundation (CNCF).

These systems are designed to manage a large number of jobs, and to survive they have a built-in strategy to kill some of those when needed to protect the overall system. There are also built-in strategies to reschedule those killed nodes. You can read more about it here:

Now there is a bug in Argo that in a very particular scenario that involves containersets, the entry container gets killed by the controller but the child containers in the pod keep running. Even though the retry is triggered the workflow still thinks that the child nodes are running and doesn’t move forward. This is a bug, and why your workflow got stuck! This is a screenshot of the workflow.

We worked with the Argo development team for about a year to address this issue, and to their credit, they pushed many fixes and addressed most of the cases but once in a while we still see this happening. That’s why we have added the Retry Failed Runs button to the UI.

I know it is not ideal but it is the best solution that we could come up with for now. Hopefully, this edge case will also be fixed in the newer releases of Argo. I will report this issue to ensure they are aware of it.

I also know this is a much longer answer and more technical details than what you would have expected but I hope it gives you a better understanding of why this might still be happening.

Let me know if you have any questions.

2 Likes