Skip to content

Investigate issue with Distinct operations (find, exists, count, delete) #2525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gregturn opened this issue May 7, 2022 · 1 comment
Closed
Assignees
Labels
for: team-attention An issue we need to discuss as a team to make progress

Comments

@gregturn
Copy link
Contributor

gregturn commented May 7, 2022

When applying a Distinct keyword in Spring Data JPA, it seems to lead to wrong results.

For example, if someone defines a UserRepository with a findDistinctByLastname(String lastname), what is supposed to happen? Should it filter out ALL entries of the table based upon lastname and then do a DISTINCT? Because this can have different results.

Imagine having a data set with:

{
    firstname: "Frodo", lastname: "Baggins", role: "ring bearer"
    firstname: "Bilbo", lastname: "Baggins", role: "burglar"
    firstname: "Samwise", lastname: "Gamgee", role: "gardener"
}

select distinct u.id from User u where u.lastname = Baggins -> you'll get two rows, Frodo and Bilbo.
select distinct u.lastname from User u where u.lastname = ?1-> you'll get one row, one for Baggins.

By default, u.id are the columns selected when doing any form of custom finder with a Distinct applied. This turns out to be correct in the sense that there are no duplicates. But in this situation, the distinct didn't contribute anything. The second query actually doesn't work, because the results are List<String>, and Spring Data doesn't know how to map this onto List<User> without a converter.

This also impacts COUNT operations. For example, a countDistinctByLastname(String lastname) currently ends up yields select distinct count(distinct u.id) from User u where u.lastname = ?1. Now this is bug (see #1380), since it should really be just select count(distinct u.id) from User u where u.lastname = ?1.

Based on the earlier fact, this may surprise you in that countDistinctByLastname("Baggins") will produce a value of 2, since there were 2 rows that had the value of Baggins. But if you were able to to select count(distinct u.lastname) from User u where u.lastname = ?1, the result would be 1.

Since countByLastname("Baggins") produces the same result of 2, you may be wondering what the distinct does. In this case, nothing. To get an appreciable difference, you have to suppress things like primary keys.

In fact, anything involving DISTINCT almost needs to be hand-written. For example, do you want to count how many distinct last names there are?

select u.lastname, count(u.lastname) from User u

Do you want to count how distinct lastnames there are, using a LIKE parameter?

select u.lastname, count(u.lastname) from User u where u.lastname like '%$1%'

Neither of these queries will map onto the repository's original User type.

There is a chance that the query does some left outer joins that result in duplicate entries of the same User type, like "all users enrolled in a class", where if the User is linked to multiple Class entries. But again, is this something expressed through a oin with a DISTINCT applied?

select distinct u from User u JOIN u.classes c

Or are you better off doing some correlated subquery?

select u from User u where exists (
    select 'found' from Class c
    where c in u.classes
)

All of this reveals that our current support for DISTINCT is incomplete and needs to be fleshed out so we can properly support it.

@gregturn gregturn self-assigned this May 7, 2022
@gregturn gregturn added this to the 3.0 M4 (2022.0.0) milestone May 7, 2022
@gregturn gregturn added the for: team-attention An issue we need to discuss as a team to make progress label May 7, 2022
@gregturn
Copy link
Contributor Author

gregturn commented May 9, 2022

In light of a team discussion, we can only from queries as best we can.

We have fixed one distinct issue in #1380. We also updated ref docs to properly share that writing queries with distinct clauses can be tricky and that users should put careful thought in order to get proper results.

@gregturn gregturn closed this as completed May 9, 2022
@gregturn gregturn removed this from the 3.0 M4 (2022.0.0) milestone May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: team-attention An issue we need to discuss as a team to make progress
Projects
None yet
Development

No branches or pull requests

1 participant