We’re currently experiencing some latency issues with our internal backend messaging system. As a result some of you (hi @Patryk ) might have noticed that Jobs/Runs are left hanging for hours.
We apologise for this and thank you for bearing with us while we work through this issue. We are going to do what we can to recover any lost data but unfortunately some runs might be lost.
Feel free to comment below with any questions/concerns/feedback You can tag myself or @tyler if something urgent comes up!
Hi @Patryk, that actually gave us a very good case to see how far we can go before something breaks. Thanks!
I think you can just let them be. We might need to cancel some of them on your behalf and you can re-run them later but let’s see where do we get with this. Thanks again!
Hi @Patryk I am looking into this now. As @Mostapha said, we may need to cancel some on your behalf, but will follow up here afterwards. Thanks again for giving us a great use case!
So I ended up having to cancel the jobs that had started >= 5 runs in order to reduce the load on our database. The issue is with how we are storing/updating the jobs in the database which became a bottleneck for a couple different services. This is an optimization that we will make a high priority.
Some of the jobs will continue to show the status as ‘Running’ until we can do a more thorough clean up of the DB. As a workaround, you could try starting these large jobs one at a time while we implement a fix. Thanks again for your patience and for the stress test!
Just a quick update that we have been working on this and we have made some good progress. Most of the changes has already been merged into production server. We are working on some other side-effects of these changes. We will post a more comprehensive update once the issue is fully resolved!