Skip to content

Search: allow ignoring files from indexing #7308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 9, 2020
Merged

Search: allow ignoring files from indexing #7308

merged 4 commits into from
Sep 9, 2020

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Jul 21, 2020

We can't stop creating this models since we need them for the CDN, so I had to add a ignore field.

Closes #5247
Ref #7217

Comment on lines -154 to -157
# 'genindex.html',
# 'py-modindex.html',
# 'search/index.html',
# 'genindex/index.html',
# 'py-modindex/index.html',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should add these to the defaults too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, they are sphinx only

@stsewd stsewd requested a review from a team July 21, 2020 21:54
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is super cool! 💯

'404.html',
'404/index.html',
]
search_ignore = self.pop_config('search.ignore', ignore_default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user adds a page, ignoreme.html, are we avoiding the ones from the list ignore_default? Shouldn't we combine the user list with the default list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think is a good idea to combine them, some users may want to override this in order to get rid of the defaults.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think users will want results from 404 or search pages.

With the current behavior, adding just one page to be ignored means adding 5 entries in the config file: your page and the defaults.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still have valid content in 404/index.html and in search/index.html. 404.html and search.html are pages that kind of always don't have valid content. We shouldn't lock users from not ignoring defaults.

Comment on lines +546 to +547
# Ignore all files under the search/ directory
- search/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen with some/path/search/index.html? is it ignored or not? Using relative paths here may bring this confusion to users. We could make this clear here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all patterns are matched from the beginning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can start a path with / in the config, but it's the same as one without it. Using an absolute path like saying it will ignore all from /search/ isn't quite right either, since the file probably is at /en/latest/search/..

So, not sure how to make this example more clear.

:Default: ``['search.html', 'search/index.html', '404.html', '404/index.html']``

Patterns are matched against the final html pages produced by the build
(you should try to match `index.html`, not `docs/index.rst`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the pattern to use if the project is built with pretty URLs? / for index? /path/to/page/ for another page? This may be explained in the docs as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the defaults, I would guess that if I want to ignore /404/, I will need to put /404/index.html. If that's correct, we definitely need to explain this because it will bring lot of confusions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, having the URL disassociated from the path you need to put here is not good UX, IMO. However, I'm not sure how we can improve that. Is is possible to just use URLs here instead of path files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We index content from paths, so makes sense to use paths, as we do with rankings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We index content from paths, so makes sense to use paths

I agree that makes sense from our side. I'm thinking if it's the same from the user's perspective. They see a URL like /en/latest/404/ and they need to know that 404/index.html is the correct value, which looks completely different than the URL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't thing is worth it to try to guess the correct path, that introduces a lot of problems rather than helping, we already ask the users to match the final path rather than the source file.

Comment on lines +1931 to +1937
('/foo/bar', 'foo/bar'),
('///foo//bar', 'foo/bar'),
('///foo//bar/', 'foo/bar'),
('/foo/bar/../', 'foo'),
('/foo*', 'foo*'),
('/foo/bar/*', 'foo/bar/*'),
('/foo/bar?/*', 'foo/bar?/*'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make a decision here and support only one case: with or without starting /, but not both. Having to values for the same config may only confuse users: /foo/bar will work and foo/bar too. Both will have the same effect, but I would be expecting different things as a user.

I think it makes more sense to use relative paths here, considering that /en/latest/ is not taken into account here. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that confusing, it would be annoying to fail the build just because you included or not /

'404.html',
'404/index.html',
]
search_ignore = self.pop_config('search.ignore', ignore_default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think users will want results from 404 or search pages.

With the current behavior, adding just one page to be ignored means adding 5 entries in the config file: your page and the defaults.

Comment on lines +523 to +524
Don't index files matching a pattern.
This is, you won't see search results from these files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the code, I guess that the ignored page will affect only the version that has this config file, right? If that's correct, we should probably communicate this in the documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All options from the configuration file are per-version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we mentioning that somewhere? That's my point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do in https://docs.readthedocs.io/en/stable/config-file/index.html, but I can see helpful mentioning it again at the top of this file.

@stale
Copy link

stale bot commented Sep 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: stale Issue will be considered inactive soon label Sep 6, 2020
@ericholscher ericholscher added the Accepted Accepted issue on our roadmap label Sep 7, 2020
@stale stale bot removed the Status: stale Issue will be considered inactive soon label Sep 7, 2020
@ericholscher
Copy link
Member

We should try and get this merged, it looks useful.

@humitos
Copy link
Member

humitos commented Sep 8, 2020

@ericholscher you may want to give your opinion at #7308 (comment) and #7308 (comment). Other than that, I think this is ready to be merged, IMO

@stsewd
Copy link
Member Author

stsewd commented Sep 8, 2020

We already use the same for search rankings, so I think we should use the same for ignoring files.

@humitos
Copy link
Member

humitos commented Sep 9, 2020

OK. That makes sense, there is no need to have two different ways to express similar things.

@stsewd stsewd merged commit 57273b8 into master Sep 9, 2020
@stsewd stsewd deleted the search-ignore branch September 9, 2020 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make search exclude files a path operation
3 participants