Splink can't distinguish between match weights above around 53 when clustering. So any record pairs above this point are treated as effectively infinite for the purposes of clustering. This is a precision issue - in clustering we work in terms of match probability, and from around this point it rounds to 1.0.
The actual figure may depend on backend - if it were to implicitly use a higher-precision (or lower!) type, for example.
Example problem (Splink 5 syntax, but issue predates):
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator
from splink.internals.misc import threshold_args_to_match_prob
target_mw = 53
def u_prob_for_mw(mw: float) -> float:
return 0.5/(2**mw)
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.ExactMatch(
"first_name"
).configure(
m_probabilities=[0.5, 0.5],
u_probabilities=[u_prob_for_mw(target_mw), 1 - u_prob_for_mw(target_mw)],
),
],
probability_two_random_records_match=0.5,
)
data = [
{
"unique_id": 1,
"first_name": "Andy",
"surname": None,
},
{
"unique_id": 2,
"first_name": "Andy",
"surname": None,
}
]
db_api = DuckDBAPI()
sdf = db_api.register(data)
linker = Linker(sdf, settings)
df_e = linker.inference.predict()
df_e.as_duckdbpyrelation().show() # match weight of only edge is 53
df_c = linker.clustering.cluster_pairwise_predictions_at_threshold(
df_e,
threshold_match_weight=100,
)
df_c.as_duckdbpyrelation().show() # nevertheless they are clustered together
print(
threshold_args_to_match_prob(
threshold_match_probability=None,
threshold_match_weight=target_mw,
)
) # actually not quite 1, but here the match_prob is 1.0 in backend
I encountered this in a testing setup - I would not normally cluster at a very high threshold, but was surprised when the logic nevertheless failed.
I think we could probably switch to using match weight as the basis for filtering edges and convert from probabilities, rather than the other way round, and we would avoid this issue.
Splink can't distinguish between match weights above around 53 when clustering. So any record pairs above this point are treated as effectively infinite for the purposes of clustering. This is a precision issue - in clustering we work in terms of match probability, and from around this point it rounds to 1.0.
The actual figure may depend on backend - if it were to implicitly use a higher-precision (or lower!) type, for example.
Example problem (Splink 5 syntax, but issue predates):
I encountered this in a testing setup - I would not normally cluster at a very high threshold, but was surprised when the logic nevertheless failed.
I think we could probably switch to using match weight as the basis for filtering edges and convert from probabilities, rather than the other way round, and we would avoid this issue.