Can't distinguish high match weights for clustering

Splink can't distinguish between match weights above around 53 when clustering. So any record pairs above this point are treated as effectively infinite for the purposes of clustering. This is a precision issue - in clustering we work in terms of match probability, and from around this point it rounds to 1.0.

The actual figure may depend on backend - if it were to implicitly use a higher-precision (or lower!) type, for example.

Example problem (Splink 5 syntax, but issue predates):

```py
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator
from splink.internals.misc import threshold_args_to_match_prob

target_mw = 53

def u_prob_for_mw(mw: float) -> float:
    return 0.5/(2**mw)

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch(
            "first_name"
        ).configure(
            m_probabilities=[0.5, 0.5],
            u_probabilities=[u_prob_for_mw(target_mw), 1 - u_prob_for_mw(target_mw)],
        ),
    ],
    probability_two_random_records_match=0.5,
)

data = [
    {
        "unique_id": 1,
        "first_name": "Andy",
        "surname": None,
    },
    {
        "unique_id": 2,
        "first_name": "Andy",
        "surname": None,
    }
]

db_api = DuckDBAPI()
sdf = db_api.register(data)
linker = Linker(sdf, settings)

df_e = linker.inference.predict()
df_e.as_duckdbpyrelation().show()  # match weight of only edge is 53
df_c = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_e,
    threshold_match_weight=100,
)
df_c.as_duckdbpyrelation().show()  # nevertheless they are clustered together

print(
    threshold_args_to_match_prob(
        threshold_match_probability=None,
        threshold_match_weight=target_mw,
    )
)  # actually not quite 1, but here the match_prob is 1.0 in backend

```

I encountered this in a testing setup - I would not normally cluster at a very high threshold, but was surprised when the logic nevertheless failed.

I think we could probably switch to using match weight as the basis for filtering edges and convert from probabilities, rather than the other way round, and we would avoid this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't distinguish high match weights for clustering #3002

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't distinguish high match weights for clustering #3002

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions