Parametric energy model failing run

max · July 6, 2021, 6:35am

Thanks for this! I’m getting confused trying to debug one of my own jobs - hope you can help.

I’m running a parameter study with energy modelling for 108 design variations: max/demo/765201e7-169c-486b-b3a1-5bb000742aa5

In GH, it’s telling me that 107/108 are completed.

In Pollination, on the job page, it says one of the runs is still running:

When I go into that particular run, it still says it’s running:

But when I go on to Debug, it looks like one of the commands failed:

There are no error logs so I can’t figure out what went wrong. I have noticed that the two subroutines look identical (why are there two?) but that the second one completed successfully:

When I reinstate the parameters of that design variation and run it locally, the simulation is successful.

Interestingly, debug section looks the same for all other (successful) runs in this job:

I’m a bit lost - any advice or clarifications would be very much appreciated!

tyler · July 6, 2021, 5:41pm

Hey @Max !

Thanks for bringing this to our attention, and apologies for the confusion. It looks like the run completed successfully (hence why it looks the same as the others) but our monitoring service didn’t catch that it was successful. Which is why the web application shows that it is running.

I’m going to look into this more today, but right now it looks like the service that controls the workflows was restarted while this run was in-progress which may have caused the update service to miss the change in status.

The reason that there are 2 nodes is that the workflow engine is set to re-try any failed nodes to solve errors caused by latency, e.g. trying to download a file that has not finished uploading yet.

tyler · July 6, 2021, 10:07pm

Just to follow up on this: our run monitoring service was indeed unresponsive for several hours while this job was in-progress. We had implemented a check to prevent this from happening previously, but forgot to enable the check on our compute cluster.

I merged the necessary change earlier for testing and it will be in production soon. Apologies again for the confusion! But, thanks for using a real-world simulation, this is what we need to be able to improve the platform!