Hi,
I am using the following function to check the progress of a job executed on the cloud.
def check_study_status(study: Job):
""""""
status = study.status.status
while True:
status_info = study.status
print('\t# ------------------ #')
print(f'\t# pending runs: {status_info.runs_pending}')
print(f'\t# running runs: {status_info.runs_running}')
print(f'\t# failed runs: {status_info.runs_failed}')
print(f'\t# completed runs: {status_info.runs_completed}')
if status in [
JobStatusEnum.pre_processing, JobStatusEnum.running, JobStatusEnum.created,
JobStatusEnum.unknown
]:
time.sleep(15)
study.refresh()
status = status_info.status
else:
# study is finished
time.sleep(2)
break
Previously it was working ok. But today, it returns the following message:
Traceback (most recent call last):
File ~\AppData\Local\anaconda3\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
exec(code, globals, locals)
File c:\users\arr18sep\desktop\pollination\marl_cloud.py:860
check_study_status(study=study)
File ~\Desktop\pollination\herp\pollination_interact.py:136 in check_study_status
study.refresh()
File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\interactors.py:95 in refresh
self._fetch_runs()
File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\interactors.py:88 in _fetch_runs
self._runs = self.run_api.get_runs(self.owner, self.project, self.id)
File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\api\runs.py:22 in get_runs
return self._run_results_request(owner, project, job_id)
File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\api\runs.py:10 in _run_results_request
res = self.client.get(
File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\api\client.py:82 in get
res.raise_for_status()
File ~\AppData\Local\anaconda3\lib\site-packages\requests\models.py:1021 in raise_for_status
raise HTTPError(http_error_msg, response=self)
HTTPError: 500 Server Error: Internal Server Error for url: https://api.pollination.cloud/projects/centipede-llc/seventh_tst/results?job_id=55ece200-dbed-466f-a889-0051d4feaadb&page=1
I have seen this before, but it was unusual. Now it does not even let me download the first of a series of jobs to submit. The models are uploaded, but apparently, after the first run of the function, it loses communication with the server and does not receive any further response.
Thanks for your help!
I have tried several times, even after receiving your response. The issue persists. The first batch of models is uploaded, but then the check study status function runs only once. The output freezes for more than a minute, and the same error is returned again. I have tried different IDEs, refreshing the API token and changing the project folder. Nothing seems to work.
This is the output just before the HTTP error:
It freezes there for more than a minute, and finally, the Error 500 is returned. This is definitely a communication error, as I can access my account and see the models uploaded and the jobs completed.
@serpoag - I updated the code under the other topic and tried running it a couple of times. Everything works as expected. Can you give it a try and let me know if it also works for you? Thanks.
Running perfectly as usual. Thanks! I tried with both the old code and the new one, both working fine. I guess it was a temporal issue from the server side. Did you find any possible cause?
Excellent! Yes. It was a misconfiguration on our end that affected a few internal calls. They were being timed out, so you would get a 500 response. This issue should not happen again. The other problem you faced before was because of the infrastructure unavailability which is out of our control. We have workflows in place that makes a recovery from those instances quickly but it can take a few seconds. The new check that I put in the code should keep your script running until the automated fix kicks in.