Investigate issue with Distinct operations (find, exists, count, delete) #2525

gregturn · 2022-05-07T16:34:38Z

When applying a Distinct keyword in Spring Data JPA, it seems to lead to wrong results.

For example, if someone defines a UserRepository with a findDistinctByLastname(String lastname), what is supposed to happen? Should it filter out ALL entries of the table based upon lastname and then do a DISTINCT? Because this can have different results.

Imagine having a data set with:

{
    firstname: "Frodo", lastname: "Baggins", role: "ring bearer"
    firstname: "Bilbo", lastname: "Baggins", role: "burglar"
    firstname: "Samwise", lastname: "Gamgee", role: "gardener"
}

select distinct u.id from User u where u.lastname = Baggins -> you'll get two rows, Frodo and Bilbo.
select distinct u.lastname from User u where u.lastname = ?1-> you'll get one row, one for Baggins.

By default, u.id are the columns selected when doing any form of custom finder with a Distinct applied. This turns out to be correct in the sense that there are no duplicates. But in this situation, the distinct didn't contribute anything. The second query actually doesn't work, because the results are List<String>, and Spring Data doesn't know how to map this onto List<User> without a converter.

This also impacts COUNT operations. For example, a countDistinctByLastname(String lastname) currently ends up yields select distinct count(distinct u.id) from User u where u.lastname = ?1. Now this is bug (see #1380), since it should really be just select count(distinct u.id) from User u where u.lastname = ?1.

Based on the earlier fact, this may surprise you in that countDistinctByLastname("Baggins") will produce a value of 2, since there were 2 rows that had the value of Baggins. But if you were able to to select count(distinct u.lastname) from User u where u.lastname = ?1, the result would be 1.

Since countByLastname("Baggins") produces the same result of 2, you may be wondering what the distinct does. In this case, nothing. To get an appreciable difference, you have to suppress things like primary keys.

In fact, anything involving DISTINCT almost needs to be hand-written. For example, do you want to count how many distinct last names there are?

select u.lastname, count(u.lastname) from User u

Do you want to count how distinct lastnames there are, using a LIKE parameter?

select u.lastname, count(u.lastname) from User u where u.lastname like '%$1%'

Neither of these queries will map onto the repository's original User type.

There is a chance that the query does some left outer joins that result in duplicate entries of the same User type, like "all users enrolled in a class", where if the User is linked to multiple Class entries. But again, is this something expressed through a oin with a DISTINCT applied?

select distinct u from User u JOIN u.classes c

Or are you better off doing some correlated subquery?

select u from User u where exists (
    select 'found' from Class c
    where c in u.classes
)

All of this reveals that our current support for DISTINCT is incomplete and needs to be fleshed out so we can properly support it.

The text was updated successfully, but these errors were encountered:

gregturn · 2022-05-09T15:00:02Z

In light of a team discussion, we can only from queries as best we can.

We have fixed one distinct issue in #1380. We also updated ref docs to properly share that writing queries with distinct clauses can be tricky and that users should put careful thought in order to get proper results.

gregturn self-assigned this May 7, 2022

gregturn added this to the 3.0 M4 (2022.0.0) milestone May 7, 2022

gregturn added the for: team-attention An issue we need to discuss as a team to make progress label May 7, 2022

gregturn mentioned this issue May 7, 2022

The repository countDistinctBy** method generates the wrong sql query [DATAJPA-1036] #1380

Closed

gregturn closed this as completed May 9, 2022

gregturn removed this from the 3.0 M4 (2022.0.0) milestone May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate issue with Distinct operations (find, exists, count, delete) #2525

Investigate issue with Distinct operations (find, exists, count, delete) #2525

gregturn commented May 7, 2022 •

edited by schauder

Loading

gregturn commented May 9, 2022

Investigate issue with Distinct operations (find, exists, count, delete) #2525

Investigate issue with Distinct operations (find, exists, count, delete) #2525

Comments

gregturn commented May 7, 2022 • edited by schauder Loading

gregturn commented May 9, 2022

gregturn commented May 7, 2022 •

edited by schauder

Loading