Search: allow ignoring files from indexing #7308

stsewd · 2020-07-21T19:18:08Z

We can't stop creating this models since we need them for the CDN, so I had to add a ignore field.

stsewd · 2020-07-21T19:19:35Z

readthedocs/search/documents.py

-        #     'genindex.html',
-        #     'py-modindex.html',
-        #     'search/index.html',
-        #     'genindex/index.html',
-        #     'py-modindex/index.html',


Not sure if we should add these to the defaults too.

I mean, they are sphinx only

humitos

This feature is super cool! 💯

humitos · 2020-07-22T08:54:46Z

readthedocs/config/config.py

+                '404.html',
+                '404/index.html',
+            ]
+            search_ignore = self.pop_config('search.ignore', ignore_default)


If the user adds a page, ignoreme.html, are we avoiding the ones from the list ignore_default? Shouldn't we combine the user list with the default list?

I don't think is a good idea to combine them, some users may want to override this in order to get rid of the defaults.

I don't think users will want results from 404 or search pages.

With the current behavior, adding just one page to be ignored means adding 5 entries in the config file: your page and the defaults.

You can still have valid content in 404/index.html and in search/index.html. 404.html and search.html are pages that kind of always don't have valid content. We shouldn't lock users from not ignoring defaults.

humitos · 2020-07-22T08:58:13Z

docs/config-file/v2.rst

+        # Ignore all files under the search/ directory
+        - search/*


What happen with some/path/search/index.html? is it ignored or not? Using relative paths here may bring this confusion to users. We could make this clear here.

all patterns are matched from the beginning

you can start a path with / in the config, but it's the same as one without it. Using an absolute path like saying it will ignore all from /search/ isn't quite right either, since the file probably is at /en/latest/search/..

So, not sure how to make this example more clear.

docs/config-file/v2.rst

humitos · 2020-07-22T09:15:22Z

docs/config-file/v2.rst

+:Default: ``['search.html', 'search/index.html', '404.html', '404/index.html']``
+
+Patterns are matched against the final html pages produced by the build
+(you should try to match `index.html`, not `docs/index.rst`).


How is the pattern to use if the project is built with pretty URLs? / for index? /path/to/page/ for another page? This may be explained in the docs as well.

From the defaults, I would guess that if I want to ignore /404/, I will need to put /404/index.html. If that's correct, we definitely need to explain this because it will bring lot of confusions.

Also, having the URL disassociated from the path you need to put here is not good UX, IMO. However, I'm not sure how we can improve that. Is is possible to just use URLs here instead of path files?

We index content from paths, so makes sense to use paths, as we do with rankings.

We index content from paths, so makes sense to use paths

I agree that makes sense from our side. I'm thinking if it's the same from the user's perspective. They see a URL like /en/latest/404/ and they need to know that 404/index.html is the correct value, which looks completely different than the URL.

I don't thing is worth it to try to guess the correct path, that introduces a lot of problems rather than helping, we already ask the users to match the final path rather than the source file.

humitos · 2020-07-23T14:09:04Z

readthedocs/config/tests/test_config.py

+        ('/foo/bar', 'foo/bar'),
+        ('///foo//bar', 'foo/bar'),
+        ('///foo//bar/', 'foo/bar'),
+        ('/foo/bar/../', 'foo'),
+        ('/foo*', 'foo*'),
+        ('/foo/bar/*', 'foo/bar/*'),
+        ('/foo/bar?/*', 'foo/bar?/*'),


I would make a decision here and support only one case: with or without starting /, but not both. Having to values for the same config may only confuse users: /foo/bar will work and foo/bar too. Both will have the same effect, but I would be expecting different things as a user.

I think it makes more sense to use relative paths here, considering that /en/latest/ is not taken into account here. What do you think?

I don't see that confusing, it would be annoying to fail the build just because you included or not /

humitos · 2020-07-23T14:12:00Z

readthedocs/config/config.py

+                '404.html',
+                '404/index.html',
+            ]
+            search_ignore = self.pop_config('search.ignore', ignore_default)


I don't think users will want results from 404 or search pages.

With the current behavior, adding just one page to be ignored means adding 5 entries in the config file: your page and the defaults.

humitos · 2020-07-23T14:15:43Z

docs/config-file/v2.rst

+Don't index files matching a pattern.
+This is, you won't see search results from these files.


Reading the code, I guess that the ignored page will affect only the version that has this config file, right? If that's correct, we should probably communicate this in the documentation.

All options from the configuration file are per-version.

Are we mentioning that somewhere? That's my point.

We do in https://docs.readthedocs.io/en/stable/config-file/index.html, but I can see helpful mentioning it again at the top of this file.

readthedocs/projects/tasks.py

stale · 2020-09-06T23:23:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ericholscher · 2020-09-07T17:00:49Z

We should try and get this merged, it looks useful.

humitos · 2020-09-08T09:52:55Z

@ericholscher you may want to give your opinion at #7308 (comment) and #7308 (comment). Other than that, I think this is ready to be merged, IMO

stsewd · 2020-09-08T15:55:44Z

We already use the same for search rankings, so I think we should use the same for ignoring files.

humitos · 2020-09-09T15:20:48Z

OK. That makes sense, there is no need to have two different ways to express similar things.

Search: allow ignoring files from indexing

b66e817

Closes #5247 Ref #7217

stsewd commented Jul 21, 2020

View reviewed changes

stsewd requested a review from a team July 21, 2020 21:54

humitos reviewed Jul 22, 2020

View reviewed changes

humitos reviewed Jul 23, 2020

View reviewed changes

stale bot added the Status: stale Issue will be considered inactive soon label Sep 6, 2020

ericholscher added the Accepted Accepted issue on our roadmap label Sep 7, 2020

stale bot removed the Status: stale Issue will be considered inactive soon label Sep 7, 2020

humitos approved these changes Sep 9, 2020

View reviewed changes

stsewd added 3 commits September 9, 2020 15:46

Merge branch 'master' into search-ignore

0a8b99c

Update migration

27717cb

Update tests

8498779

stsewd merged commit 57273b8 into master Sep 9, 2020

stsewd deleted the search-ignore branch September 9, 2020 21:40

stsewd mentioned this pull request Oct 27, 2020

Search related settings in the configuration file #7217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search: allow ignoring files from indexing #7308

Search: allow ignoring files from indexing #7308

stsewd commented Jul 21, 2020 •

edited

Loading

stsewd Jul 21, 2020

stsewd Jul 21, 2020

humitos left a comment

humitos Jul 22, 2020

stsewd Jul 22, 2020

humitos Jul 23, 2020

stsewd Jul 23, 2020

humitos Jul 22, 2020

stsewd Jul 22, 2020

stsewd Jul 22, 2020

humitos Jul 22, 2020

humitos Jul 22, 2020

humitos Jul 22, 2020

stsewd Jul 22, 2020

humitos Jul 23, 2020

stsewd Jul 23, 2020

humitos Jul 23, 2020

stsewd Jul 23, 2020

humitos Jul 23, 2020

humitos Jul 23, 2020

stsewd Jul 23, 2020

humitos Jul 23, 2020

stsewd Jul 23, 2020

stale bot commented Sep 6, 2020

ericholscher commented Sep 7, 2020

humitos commented Sep 8, 2020

stsewd commented Sep 8, 2020

humitos commented Sep 9, 2020

		Don't index files matching a pattern.
		This is, you won't see search results from these files.

Search: allow ignoring files from indexing #7308

Search: allow ignoring files from indexing #7308

Conversation

stsewd commented Jul 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Sep 6, 2020

ericholscher commented Sep 7, 2020

humitos commented Sep 8, 2020

stsewd commented Sep 8, 2020

humitos commented Sep 9, 2020

stsewd commented Jul 21, 2020 •

edited

Loading