Skip to content

Commit 434253d

Browse files
fix: Major refactoring of Polling, Retry and Timeout logic (#462)
* fix: Major refactoring and fix for Polling, Retry and Timeout logic This is in response to https://freeman.vc/notes/aws-vs-gcp-reliability-is-wildly-different, which triggered an investigation of the whole Polling/Retry/Timeout behavior in Python GAPIC clients and revealed many fundamental flaws in its implementaiton. To properly describe the refactoring this PR does we need to stick to a rigorous terminology, as vague definitions of retries, timeouts, polling and related concepts seems to be the main source of the present bugs and overal confusion among both groups: users of the library and creators of the library. Please check the documentation of the `google.api_core.retry.Retry` class the `google.api_core.future.polling.Polling.result()` method for the proper definitions and context. Note, the overall semantics around Polling, Retry and Timeout remains quite confusing even after refactoring (although it is now more or less rigorously defined), but it was clean as I could make it while still maintaining backward compatibility of the whole library. The quick summary of the changes in this PR: 1) Properly define and fix the application of Deadline and Timeout concepts. Please check the updated documentation for the `google.api_core.retry.Retry` class for the actual definitions. Originally the `deadline` has been used to represent timeouts conflating the two concepts. As result this PR replaces `deadline` arguments with `timeout` ones in as backward-compatible manner as possible (i.e. backward compatible in all practical applications). 2) Properly define RPC Timeout, Retry Timeout and Pollint Timeout and how a generic Timeout concept (aka Logical Timeout) is mapped to one of those depending on the context. Please check `google.api_core.retry.Retry` class documentation for details. 3) Properly define and fix the application of Retry and Polling concepts. Please check the updated documentation for `google.api_core.future.polling.PollingFuture.result()` for details. 4) Separate `retry` and `polling` configurations for Polling future, as these are two different concepts (although both operating on `Retry` class). Originally both retry and polling configurations were controlled by a single `retry` parameter, merging configuration regarding how "rpc error responses" and how "operation not completed" responses are supposed to be handled. 5) For the following config properties - `Retry (including `Retry Timeout`), `Polling` (including `Polling Timeout`) and `RPC Timeout` - fix and properly define how each of the above properties gets configured and which config gets precedence in case of a conflict (check `PollingFuture.result()` method documentation for details). Each of those properties can be specified as follows: directly provided by the user for each call, specified during gapic generation time from config values in `grpc_service_config.json` file (for Retry and RPC Timeout) and `gapic.yaml` file (for Polling), or be provided as a hard-coded basic default values in python-api-core library itself. 6) Fix the per-call polling config propagation logic (the polling/retry configs supplied to `PollingFuture.result()` used to be ignored for actual call). 7) Deprecate the usage of `deadline` terminology in the whole library and backward-compatibly replace it with timeout. This is essential as what has been called "deadline" in this library was actually "timeout" as it is defined in `google.api_core.retry.Retry` class documentation. 8) Deprecate `ExponentialTimeout`, `ConstantTimeout` and related logic as those are outdated concepts and are not consistent with the other GAPIC Languages. Replace it with `TimeToDeadlineTimeout` to be consistent with how the rest of the languages do it. 9) Deprecate `google.api_core.operations_v1.config` as it is an outdated concept and self-inconsistent (as all gapic clients provide configuraiton in code). The configs are directly provided in code instead. 10) Switch randomized delay calculation from `delay` being treated as expected value for randomized_delay to `delay` being treated as maximum value for `randomized_delay` (i.e. the new expected valud for `randomized_delay` is `delay / 2`). See the `exponential_sleep_generator()` method implementation for details. This is needed to make Python implementation of retries and polling exponential backoff consistent with the rest of GAPIC languages. Also fix the uncontrollable growth of `delay` value (since it is a subject of exponential growth, the `delay` value was quickly reaching "infinity" value, and the whole thing was not failing simply due to python being a very forgiving language which forgives multiplying "infinity" by a number (`inf * number = inf`) binstead of simply overflowing to a (most likely) negative number). 11) Fix url construction in `OperationsRestTransport`. Without this fix the polling logic for REST transport was completely broken (is not affecting Compute client, as that one has custom LRO). 12) Las but not least: change the default values for Polling logic to be the following: `initial=1.0` (same as before), `maximum=20.0` (was `60`), `multiplier=1.5` (was `2.0`), `timeout=900` (was `120`, but due to timeout resolution logic was actually None (i.e. infinity)). This, in conjunction with changed calculation of randomized delay (i.e. its expected value now being `delay / 2`) overall makes polling logic much less aggressive in terms of increasing delays between each polling iteration, making LRO return much earlier for users on average, but still keeping a healthy balance between strain put on both client and server by polling and responsiveness of LROs for user. *The design doc summarising all the changes and reasons for them is in progress. * fix ci failures (mainly sphinx errors) * remove unused code * fix typo * Pin pytest version to <7.2.0 * reformat code * address pr feedback * address PR feedback * address pr feedback * Update google/api_core/future/polling.py Co-authored-by: Victor Chudnovsky <[email protected]> * Apply documentation suggestions from code review Co-authored-by: Victor Chudnovsky <[email protected]> * Address PR feedback Co-authored-by: Victor Chudnovsky <[email protected]>
1 parent ddcc249 commit 434253d

34 files changed

+789
-416
lines changed

.github/workflows/lint.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ jobs:
1212
- name: Setup Python
1313
uses: actions/setup-python@v4
1414
with:
15-
python-version: "3.7"
15+
python-version: "3.10"
1616
- name: Install nox
1717
run: |
1818
python -m pip install --upgrade setuptools pip wheel

.github/workflows/mypy.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ jobs:
1212
- name: Setup Python
1313
uses: actions/setup-python@v4
1414
with:
15-
python-version: "3.7"
15+
python-version: "3.10"
1616
- name: Install nox
1717
run: |
1818
python -m pip install --upgrade setuptools pip wheel

google/api_core/extended_operation.py

+19-9
Original file line numberDiff line numberDiff line change
@@ -50,10 +50,13 @@ class ExtendedOperation(polling.PollingFuture):
5050
refresh (Callable[[], type(extended_operation)]): A callable that returns
5151
the latest state of the operation.
5252
cancel (Callable[[], None]): A callable that tries to cancel the operation.
53-
retry: Optional(google.api_core.retry.Retry): The retry configuration used
54-
when polling. This can be used to control how often :meth:`done`
55-
is polled. Regardless of the retry's ``deadline``, it will be
56-
overridden by the ``timeout`` argument to :meth:`result`.
53+
polling Optional(google.api_core.retry.Retry): The configuration used
54+
for polling. This can be used to control how often :meth:`done`
55+
is polled. If the ``timeout`` argument to :meth:`result` is
56+
specified it will override the ``polling.timeout`` property.
57+
retry Optional(google.api_core.retry.Retry): DEPRECATED use ``polling``
58+
instead. If specified it will override ``polling`` parameter to
59+
maintain backward compatibility.
5760
5861
Note: Most long-running API methods use google.api_core.operation.Operation
5962
This class is a wrapper for a subset of methods that use alternative
@@ -68,9 +71,14 @@ class ExtendedOperation(polling.PollingFuture):
6871
"""
6972

7073
def __init__(
71-
self, extended_operation, refresh, cancel, retry=polling.DEFAULT_RETRY
74+
self,
75+
extended_operation,
76+
refresh,
77+
cancel,
78+
polling=polling.DEFAULT_POLLING,
79+
**kwargs,
7280
):
73-
super().__init__(retry=retry)
81+
super().__init__(polling=polling, **kwargs)
7482
self._extended_operation = extended_operation
7583
self._refresh = refresh
7684
self._cancel = cancel
@@ -114,7 +122,7 @@ def error_message(self):
114122
def __getattr__(self, name):
115123
return getattr(self._extended_operation, name)
116124

117-
def done(self, retry=polling.DEFAULT_RETRY):
125+
def done(self, retry=None):
118126
self._refresh_and_update(retry)
119127
return self._extended_operation.done
120128

@@ -137,9 +145,11 @@ def cancelled(self):
137145
self._refresh_and_update()
138146
return self._extended_operation.done
139147

140-
def _refresh_and_update(self, retry=polling.DEFAULT_RETRY):
148+
def _refresh_and_update(self, retry=None):
141149
if not self._extended_operation.done:
142-
self._extended_operation = self._refresh(retry=retry)
150+
self._extended_operation = (
151+
self._refresh(retry=retry) if retry else self._refresh()
152+
)
143153
self._handle_refreshed_operation()
144154

145155
def _handle_refreshed_operation(self):

google/api_core/future/async_future.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ async def _blocking_poll(self, timeout=None):
9595
if self._future.done():
9696
return
9797

98-
retry_ = self._retry.with_deadline(timeout)
98+
retry_ = self._retry.with_timeout(timeout)
9999

100100
try:
101101
await retry_(self._done_or_raise)()

google/api_core/future/polling.py

+165-35
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
import concurrent.futures
1919

2020
from google.api_core import exceptions
21-
from google.api_core import retry
21+
from google.api_core import retry as retries
2222
from google.api_core.future import _helpers
2323
from google.api_core.future import base
2424

@@ -29,14 +29,37 @@ class _OperationNotComplete(Exception):
2929
pass
3030

3131

32-
RETRY_PREDICATE = retry.if_exception_type(
32+
# DEPRECATED as it conflates RPC retry and polling concepts into one.
33+
# Use POLLING_PREDICATE instead to configure polling.
34+
RETRY_PREDICATE = retries.if_exception_type(
3335
_OperationNotComplete,
3436
exceptions.TooManyRequests,
3537
exceptions.InternalServerError,
3638
exceptions.BadGateway,
3739
exceptions.ServiceUnavailable,
3840
)
39-
DEFAULT_RETRY = retry.Retry(predicate=RETRY_PREDICATE)
41+
42+
# DEPRECATED: use DEFAULT_POLLING to configure LRO polling logic. Construct
43+
# Retry object using its default values as a baseline for any custom retry logic
44+
# (not to be confused with polling logic).
45+
DEFAULT_RETRY = retries.Retry(predicate=RETRY_PREDICATE)
46+
47+
# POLLING_PREDICATE is supposed to poll only on _OperationNotComplete.
48+
# Any RPC-specific errors (like ServiceUnavailable) will be handled
49+
# by retry logic (not to be confused with polling logic) which is triggered for
50+
# every polling RPC independently of polling logic but within its context.
51+
POLLING_PREDICATE = retries.if_exception_type(
52+
_OperationNotComplete,
53+
)
54+
55+
# Default polling configuration
56+
DEFAULT_POLLING = retries.Retry(
57+
predicate=POLLING_PREDICATE,
58+
initial=1.0, # seconds
59+
maximum=20.0, # seconds
60+
multiplier=1.5,
61+
timeout=900, # seconds
62+
)
4063

4164

4265
class PollingFuture(base.Future):
@@ -45,21 +68,29 @@ class PollingFuture(base.Future):
4568
The :meth:`done` method should be implemented by subclasses. The polling
4669
behavior will repeatedly call ``done`` until it returns True.
4770
71+
The actuall polling logic is encapsulated in :meth:`result` method. See
72+
documentation for that method for details on how polling works.
73+
4874
.. note::
4975
5076
Privacy here is intended to prevent the final class from
5177
overexposing, not to prevent subclasses from accessing methods.
5278
5379
Args:
54-
retry (google.api_core.retry.Retry): The retry configuration used
55-
when polling. This can be used to control how often :meth:`done`
56-
is polled. Regardless of the retry's ``deadline``, it will be
57-
overridden by the ``timeout`` argument to :meth:`result`.
80+
polling (google.api_core.retry.Retry): The configuration used for polling.
81+
This parameter controls how often :meth:`done` is polled. If the
82+
``timeout`` argument is specified in :meth:`result` method it will
83+
override the ``polling.timeout`` property.
84+
retry (google.api_core.retry.Retry): DEPRECATED use ``polling`` instead.
85+
If set, it will override ``polling`` paremeter for backward
86+
compatibility.
5887
"""
5988

60-
def __init__(self, retry=DEFAULT_RETRY):
89+
_DEFAULT_VALUE = object()
90+
91+
def __init__(self, polling=DEFAULT_POLLING, **kwargs):
6192
super(PollingFuture, self).__init__()
62-
self._retry = retry
93+
self._polling = kwargs.get("retry", polling)
6394
self._result = None
6495
self._exception = None
6596
self._result_set = False
@@ -69,57 +100,150 @@ def __init__(self, retry=DEFAULT_RETRY):
69100
self._done_callbacks = []
70101

71102
@abc.abstractmethod
72-
def done(self, retry=DEFAULT_RETRY):
103+
def done(self, retry=None):
73104
"""Checks to see if the operation is complete.
74105
75106
Args:
76-
retry (google.api_core.retry.Retry): (Optional) How to retry the RPC.
107+
retry (google.api_core.retry.Retry): (Optional) How to retry the
108+
polling RPC (to not be confused with polling configuration. See
109+
the documentation for :meth:`result` for details).
77110
78111
Returns:
79112
bool: True if the operation is complete, False otherwise.
80113
"""
81114
# pylint: disable=redundant-returns-doc, missing-raises-doc
82115
raise NotImplementedError()
83116

84-
def _done_or_raise(self, retry=DEFAULT_RETRY):
117+
def _done_or_raise(self, retry=None):
85118
"""Check if the future is done and raise if it's not."""
86-
kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
87-
88-
if not self.done(**kwargs):
119+
if not self.done(retry=retry):
89120
raise _OperationNotComplete()
90121

91122
def running(self):
92123
"""True if the operation is currently running."""
93124
return not self.done()
94125

95-
def _blocking_poll(self, timeout=None, retry=DEFAULT_RETRY):
96-
"""Poll and wait for the Future to be resolved.
126+
def _blocking_poll(self, timeout=_DEFAULT_VALUE, retry=None, polling=None):
127+
"""Poll and wait for the Future to be resolved."""
97128

98-
Args:
99-
timeout (int):
100-
How long (in seconds) to wait for the operation to complete.
101-
If None, wait indefinitely.
102-
"""
103129
if self._result_set:
104130
return
105131

106-
retry_ = self._retry.with_deadline(timeout)
132+
polling = polling or self._polling
133+
if timeout is not PollingFuture._DEFAULT_VALUE:
134+
polling = polling.with_timeout(timeout)
107135

108136
try:
109-
kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
110-
retry_(self._done_or_raise)(**kwargs)
137+
polling(self._done_or_raise)(retry=retry)
111138
except exceptions.RetryError:
112139
raise concurrent.futures.TimeoutError(
113-
"Operation did not complete within the designated " "timeout."
140+
f"Operation did not complete within the designated timeout of "
141+
f"{polling.timeout} seconds."
114142
)
115143

116-
def result(self, timeout=None, retry=DEFAULT_RETRY):
117-
"""Get the result of the operation, blocking if necessary.
144+
def result(self, timeout=_DEFAULT_VALUE, retry=None, polling=None):
145+
"""Get the result of the operation.
146+
147+
This method will poll for operation status periodically, blocking if
148+
necessary. If you just want to make sure that this method does not block
149+
for more than X seconds and you do not care about the nitty-gritty of
150+
how this method operates, just call it with ``result(timeout=X)``. The
151+
other parameters are for advanced use only.
152+
153+
Every call to this method is controlled by the following three
154+
parameters, each of which has a specific, distinct role, even though all three
155+
may look very similar: ``timeout``, ``retry`` and ``polling``. In most
156+
cases users do not need to specify any custom values for any of these
157+
parameters and may simply rely on default ones instead.
158+
159+
If you choose to specify custom parameters, please make sure you've
160+
read the documentation below carefully.
161+
162+
First, please check :class:`google.api_core.retry.Retry`
163+
class documentation for the proper definition of timeout and deadline
164+
terms and for the definition the three different types of timeouts.
165+
This class operates in terms of Retry Timeout and Polling Timeout. It
166+
does not let customizing RPC timeout and the user is expected to rely on
167+
default behavior for it.
168+
169+
The roles of each argument of this method are as follows:
170+
171+
``timeout`` (int): (Optional) The Polling Timeout as defined in
172+
:class:`google.api_core.retry.Retry`. If the operation does not complete
173+
within this timeout an exception will be thrown. This parameter affects
174+
neither Retry Timeout nor RPC Timeout.
175+
176+
``retry`` (google.api_core.retry.Retry): (Optional) How to retry the
177+
polling RPC. The ``retry.timeout`` property of this parameter is the
178+
Retry Timeout as defined in :class:`google.api_core.retry.Retry`.
179+
This parameter defines ONLY how the polling RPC call is retried
180+
(i.e. what to do if the RPC we used for polling returned an error). It
181+
does NOT define how the polling is done (i.e. how frequently and for
182+
how long to call the polling RPC); use the ``polling`` parameter for that.
183+
If a polling RPC throws and error and retrying it fails, the whole
184+
future fails with the corresponding exception. If you want to tune which
185+
server response error codes are not fatal for operation polling, use this
186+
parameter to control that (``retry.predicate`` in particular).
187+
188+
``polling`` (google.api_core.retry.Retry): (Optional) How often and
189+
for how long to call the polling RPC periodically (i.e. what to do if
190+
a polling rpc returned successfully but its returned result indicates
191+
that the long running operation is not completed yet, so we need to
192+
check it again at some point in future). This parameter does NOT define
193+
how to retry each individual polling RPC in case of an error; use the
194+
``retry`` parameter for that. The ``polling.timeout`` of this parameter
195+
is Polling Timeout as defined in as defined in
196+
:class:`google.api_core.retry.Retry`.
197+
198+
For each of the arguments, there are also default values in place, which
199+
will be used if a user does not specify their own. The default values
200+
for the three parameters are not to be confused with the default values
201+
for the corresponding arguments in this method (those serve as "not set"
202+
markers for the resolution logic).
203+
204+
If ``timeout`` is provided (i.e.``timeout is not _DEFAULT VALUE``; note
205+
the ``None`` value means "infinite timeout"), it will be used to control
206+
the actual Polling Timeout. Otherwise, the ``polling.timeout`` value
207+
will be used instead (see below for how the ``polling`` config itself
208+
gets resolved). In other words, this parameter effectively overrides
209+
the ``polling.timeout`` value if specified. This is so to preserve
210+
backward compatibility.
211+
212+
If ``retry`` is provided (i.e. ``retry is not None``) it will be used to
213+
control retry behavior for the polling RPC and the ``retry.timeout``
214+
will determine the Retry Timeout. If not provided, the
215+
polling RPC will be called with whichever default retry config was
216+
specified for the polling RPC at the moment of the construction of the
217+
polling RPC's client. For example, if the polling RPC is
218+
``operations_client.get_operation()``, the ``retry`` parameter will be
219+
controlling its retry behavior (not polling behavior) and, if not
220+
specified, that specific method (``operations_client.get_operation()``)
221+
will be retried according to the default retry config provided during
222+
creation of ``operations_client`` client instead. This argument exists
223+
mainly for backward compatibility; users are very unlikely to ever need
224+
to set this parameter explicitly.
225+
226+
If ``polling`` is provided (i.e. ``polling is not None``), it will be used
227+
to controll the overall polling behavior and ``polling.timeout`` will
228+
controll Polling Timeout unless it is overridden by ``timeout`` parameter
229+
as described above. If not provided, the``polling`` parameter specified
230+
during construction of this future (the ``polling`` argument in the
231+
constructor) will be used instead. Note: since the ``timeout`` argument may
232+
override ``polling.timeout`` value, this parameter should be viewed as
233+
coupled with the ``timeout`` parameter as described above.
118234
119235
Args:
120-
timeout (int):
121-
How long (in seconds) to wait for the operation to complete.
122-
If None, wait indefinitely.
236+
timeout (int): (Optional) How long (in seconds) to wait for the
237+
operation to complete. If None, wait indefinitely.
238+
retry (google.api_core.retry.Retry): (Optional) How to retry the
239+
polling RPC. This defines ONLY how the polling RPC call is
240+
retried (i.e. what to do if the RPC we used for polling returned
241+
an error). It does NOT define how the polling is done (i.e. how
242+
frequently and for how long to call the polling RPC).
243+
polling (google.api_core.retry.Retry): (Optional) How often and
244+
for how long to call polling RPC periodically. This parameter
245+
does NOT define how to retry each individual polling RPC call
246+
(use the ``retry`` parameter for that).
123247
124248
Returns:
125249
google.protobuf.Message: The Operation's result.
@@ -128,8 +252,8 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
128252
google.api_core.GoogleAPICallError: If the operation errors or if
129253
the timeout is reached before the operation completes.
130254
"""
131-
kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
132-
self._blocking_poll(timeout=timeout, **kwargs)
255+
256+
self._blocking_poll(timeout=timeout, retry=retry, polling=polling)
133257

134258
if self._exception is not None:
135259
# pylint: disable=raising-bad-type
@@ -138,12 +262,18 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
138262

139263
return self._result
140264

141-
def exception(self, timeout=None):
265+
def exception(self, timeout=_DEFAULT_VALUE):
142266
"""Get the exception from the operation, blocking if necessary.
143267
268+
See the documentation for the :meth:`result` method for details on how
269+
this method operates, as both ``result`` and this method rely on the
270+
exact same polling logic. The only difference is that this method does
271+
not accept ``retry`` and ``polling`` arguments but relies on the default ones
272+
instead.
273+
144274
Args:
145275
timeout (int): How long to wait for the operation to complete.
146-
If None, wait indefinitely.
276+
If None, wait indefinitely.
147277
148278
Returns:
149279
Optional[google.api_core.GoogleAPICallError]: The operation's

0 commit comments

Comments
 (0)