K8s Watch: How to deal with 401 Status Errors #403

chekolyn · 2017-11-28T00:31:52Z

Hello All,

I'm working with a cluster that has a lot of objects but I wanted to use Watch to track ReplicationController changes across all namespaces.

One of the odd issues that arises most of the time is an error status message from the API endpoint.

{'raw_object': {u'status': u'Failure', u'kind': u'Status', u'message': u'401: The event in requested index is outdated and cleared (the requested history has been cleared [2699914797/2699913361]) [2699915796]', u'apiVersion': u'v1', u'metadata': {}}, u'object': {'api_version': 'v1',
 'kind': 'Status',
 'metadata': {'annotations': None,
              'cluster_name': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'initializers': None,
              'labels': None,
              'name': None,
              'namespace': None,
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': None,
 'status': {'available_replicas': None,
            'conditions': None,
            'fully_labeled_replicas': None,
            'observed_generation': None,
            'ready_replicas': None,
            'replicas': None}}, u'type': u'ERROR'}

Looking at the error message it seems that the source of this message is from etcd:
https://github.com/coreos/etcd/blob/36c655f29b9fef15c4094093c7407967c5c5bb96/Documentation/v2/api.md#waiting-for-a-change
Looking at the etcd link they mentioned to do +1 in the revision version, but that hasn't quite worked for me at all.

I looked at kubectl and see how they handle the Watch; it seems that they force the use or Resource Version 0:
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/get.go#L261

This is how I'm calling the watch

for event_full in watch.stream(api.list_replication_controller_for_all_namespaces, resource_version=0, _request_timeout=(30, 300)):

Has anyone seen this issue before? Any suggestions?

The text was updated successfully, but these errors were encountered:

chekolyn · 2017-11-29T00:29:11Z

It seems that I might have a solution to this problem, it's inline with the docs from the etcd link above.

Here's the sub-class I created to test this:

class MyWatch(k8s_watch.Watch):
    def __init__(self, return_type=None):
        k8s_watch.Watch.__init__(self, return_type=None)

    def streamPersist(self, func, *args, **kwargs):
        for event_full in self.stream(func, *args, **kwargs):
            kind = event_full['raw_object'].get('kind', '')
            status = event_full['raw_object'].get('status', '')
            message = event_full['raw_object'].get('message', '')
            error_msg = '401: The event in requested index is outdated and cleared'

            # Check if we got a bad response from the API
            if "Status" in kind and "Failure" in status and error_msg in message:
                print "Status FAILED"
                print event_full
                match = re.search("^401.*\[(\d*)\]$", message)
                if match and match.group(1):
                    # Index number plus one
                    # See https://github.com/coreos/etcd/blob/36c655f29b9fef15c4094093c7407967c5c5bb96/Documentation/v2/api.md#waiting-for-a-change
                    resource_version = int(match.group(1)) + 1
                    print "Resource Version: %s" % resource_version
                    kwargs['resource_version'] = resource_version
                    print kwargs
                    print "calling new generator:"
                    for new_event in self.streamPersist(func, *args, **kwargs):
                        yield new_event
            else:
                yield event_full

In a cluster with a lot of activity and a lot of objects; it can take some time to stream all the data(specially if you are pulling all_namespaces data); in a regular watch call the generator will terminate once we receive the 401 msg.

In the code above I extended the Watch class with a recursive generator streamPersist that will make a new stream call to the API as soon as it detects the 401 msg. It's imperative to reconnect as soon as possible once we determine the index/resource_version and use that as a parameter for the new stream call.

@mbohlool Do you think that the native stream function should be expanded to deal with these API issues?
It seems this 401 is somewhat common in some other form; please see this thread

I can certainly help out, the thing is that I'm not sure how to replicate this issue outside of my environment. The code above certainly has a lot of room to be improved.

fejta-bot · 2019-04-22T05:49:32Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-22T06:32:03Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-21T07:22:44Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-21T07:22:51Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sturgelose mentioned this issue May 16, 2018

Watch stream should handle HTTP error before unmarshaling event kubernetes-client/python-base#57

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 22, 2019

k8s-ci-robot closed this as completed Jun 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

K8s Watch: How to deal with 401 Status Errors #403

K8s Watch: How to deal with 401 Status Errors #403

chekolyn commented Nov 28, 2017 •

edited

Loading

chekolyn commented Nov 29, 2017

Uh oh!

fejta-bot commented Apr 22, 2019

Uh oh!

fejta-bot commented May 22, 2019

Uh oh!

fejta-bot commented Jun 21, 2019

Uh oh!

k8s-ci-robot commented Jun 21, 2019

Uh oh!

K8s Watch: How to deal with 401 Status Errors #403

K8s Watch: How to deal with 401 Status Errors #403

Comments

chekolyn commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

chekolyn commented Nov 29, 2017

Uh oh!

fejta-bot commented Apr 22, 2019

Uh oh!

fejta-bot commented May 22, 2019

Uh oh!

fejta-bot commented Jun 21, 2019

Uh oh!

k8s-ci-robot commented Jun 21, 2019

Uh oh!

chekolyn commented Nov 28, 2017 •

edited

Loading