Skip to content

K8s Watch: How to deal with 401 Status Errors #403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chekolyn opened this issue Nov 28, 2017 · 5 comments
Closed

K8s Watch: How to deal with 401 Status Errors #403

chekolyn opened this issue Nov 28, 2017 · 5 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@chekolyn
Copy link

chekolyn commented Nov 28, 2017

Hello All,

I'm working with a cluster that has a lot of objects but I wanted to use Watch to track ReplicationController changes across all namespaces.

One of the odd issues that arises most of the time is an error status message from the API endpoint.

{'raw_object': {u'status': u'Failure', u'kind': u'Status', u'message': u'401: The event in requested index is outdated and cleared (the requested history has been cleared [2699914797/2699913361]) [2699915796]', u'apiVersion': u'v1', u'metadata': {}}, u'object': {'api_version': 'v1',
 'kind': 'Status',
 'metadata': {'annotations': None,
              'cluster_name': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'initializers': None,
              'labels': None,
              'name': None,
              'namespace': None,
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': None,
 'status': {'available_replicas': None,
            'conditions': None,
            'fully_labeled_replicas': None,
            'observed_generation': None,
            'ready_replicas': None,
            'replicas': None}}, u'type': u'ERROR'}

Looking at the error message it seems that the source of this message is from etcd:
https://github.com/coreos/etcd/blob/36c655f29b9fef15c4094093c7407967c5c5bb96/Documentation/v2/api.md#waiting-for-a-change
Looking at the etcd link they mentioned to do +1 in the revision version, but that hasn't quite worked for me at all.

I looked at kubectl and see how they handle the Watch; it seems that they force the use or Resource Version 0:
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/get.go#L261

This is how I'm calling the watch

for event_full in watch.stream(api.list_replication_controller_for_all_namespaces, resource_version=0, _request_timeout=(30, 300)):

Has anyone seen this issue before? Any suggestions?

@chekolyn
Copy link
Author

It seems that I might have a solution to this problem, it's inline with the docs from the etcd link above.

Here's the sub-class I created to test this:

class MyWatch(k8s_watch.Watch):
    def __init__(self, return_type=None):
        k8s_watch.Watch.__init__(self, return_type=None)

    def streamPersist(self, func, *args, **kwargs):
        for event_full in self.stream(func, *args, **kwargs):
            kind = event_full['raw_object'].get('kind', '')
            status = event_full['raw_object'].get('status', '')
            message = event_full['raw_object'].get('message', '')
            error_msg = '401: The event in requested index is outdated and cleared'

            # Check if we got a bad response from the API
            if "Status" in kind and "Failure" in status and error_msg in message:
                print "Status FAILED"
                print event_full
                match = re.search("^401.*\[(\d*)\]$", message)
                if match and match.group(1):
                    # Index number plus one
                    # See https://github.com/coreos/etcd/blob/36c655f29b9fef15c4094093c7407967c5c5bb96/Documentation/v2/api.md#waiting-for-a-change
                    resource_version = int(match.group(1)) + 1
                    print "Resource Version: %s" % resource_version
                    kwargs['resource_version'] = resource_version
                    print kwargs
                    print "calling new generator:"
                    for new_event in self.streamPersist(func, *args, **kwargs):
                        yield new_event
            else:
                yield event_full

In a cluster with a lot of activity and a lot of objects; it can take some time to stream all the data(specially if you are pulling all_namespaces data); in a regular watch call the generator will terminate once we receive the 401 msg.

In the code above I extended the Watch class with a recursive generator streamPersist that will make a new stream call to the API as soon as it detects the 401 msg. It's imperative to reconnect as soon as possible once we determine the index/resource_version and use that as a parameter for the new stream call.

@mbohlool Do you think that the native stream function should be expanded to deal with these API issues?
It seems this 401 is somewhat common in some other form; please see this thread

I can certainly help out, the thing is that I'm not sure how to replicate this issue outside of my environment. The code above certainly has a lot of room to be improved.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 22, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants