Skip to content

Can't distinguish high match weights for clustering #3002

@ADBond

Description

@ADBond

Splink can't distinguish between match weights above around 53 when clustering. So any record pairs above this point are treated as effectively infinite for the purposes of clustering. This is a precision issue - in clustering we work in terms of match probability, and from around this point it rounds to 1.0.

The actual figure may depend on backend - if it were to implicitly use a higher-precision (or lower!) type, for example.

Example problem (Splink 5 syntax, but issue predates):

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator
from splink.internals.misc import threshold_args_to_match_prob

target_mw = 53

def u_prob_for_mw(mw: float) -> float:
    return 0.5/(2**mw)

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch(
            "first_name"
        ).configure(
            m_probabilities=[0.5, 0.5],
            u_probabilities=[u_prob_for_mw(target_mw), 1 - u_prob_for_mw(target_mw)],
        ),
    ],
    probability_two_random_records_match=0.5,
)

data = [
    {
        "unique_id": 1,
        "first_name": "Andy",
        "surname": None,
    },
    {
        "unique_id": 2,
        "first_name": "Andy",
        "surname": None,
    }
]

db_api = DuckDBAPI()
sdf = db_api.register(data)
linker = Linker(sdf, settings)

df_e = linker.inference.predict()
df_e.as_duckdbpyrelation().show()  # match weight of only edge is 53
df_c = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_e,
    threshold_match_weight=100,
)
df_c.as_duckdbpyrelation().show()  # nevertheless they are clustered together

print(
    threshold_args_to_match_prob(
        threshold_match_probability=None,
        threshold_match_weight=target_mw,
    )
)  # actually not quite 1, but here the match_prob is 1.0 in backend

I encountered this in a testing setup - I would not normally cluster at a very high threshold, but was surprised when the logic nevertheless failed.

I think we could probably switch to using match weight as the basis for filtering edges and convert from probabilities, rather than the other way round, and we would avoid this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions