Skip to content

[BUG] AthenaAPI cannot detect the supplied database if there are too many databases #3001

@robert-harris-tapad

Description

@robert-harris-tapad

What happens?

Hi,

I'm running

splink 4.0.16 on
a Sagemaker Studio Jypterlab instance on AWS

I am able to connect to Athena via the API. However, I cannot connect to the database (that exists!) that I prefer to use. This is the relevant snippet:

import boto3
from splink.backends.athena import AthenaAPI

REGION = "us-west-2"
S3_OUTPUT_LOCATION = "MYS3OUTPUTLOCATION"

boto3_session = boto3.Session(region_name=REGION)
aws_filepath = S3_OUTPUT_LOCATION
db_api = AthenaAPI(
boto3_session,
output_bucket=bucket,
output_database=database,
output_filepath=filepath,
)

But upon execution, I get the following traceback


InvalidAWSBucketOrDatabase Traceback (most recent call last)
Cell In[4], line 24
22 boto3_session = boto3.Session(region_name="us-west-2")
23 aws_filepath = S3_OUTPUT_LOCATION
---> 24 db_api = AthenaAPI(
25 boto3_session,
26 output_bucket=bucket,
27 output_database=database,
28 output_filepath=filepath,
29 )
30 import numpy as np

File /opt/conda/lib/python3.12/site-packages/splink/internals/athena/database_api.py:41, in AthenaAPI.init(self, boto3_session, output_database, output_bucket, output_filepath)
37 raise ValueError("Please enter a valid boto3 session object.")
39 self.sql_dialect = "presto"
---> 41 _verify_athena_inputs(output_database, output_bucket, boto3_session)
42 self.boto3_session = boto3_session
43 self.output_schema = output_database

File /opt/conda/lib/python3.12/site-packages/splink/internals/athena/athena_helpers/athena_utils.py:31, in _verify_athena_inputs(database, bucket, boto3_session)
29 database_bucket_txt = " and ".join(errors)
30 do_does_grammar = ["does", "it"] if len(errors) == 1 else ["do", "them"]
---> 31 raise InvalidAWSBucketOrDatabase(
32 athena_warning_text(database_bucket_txt, do_does_grammar)
33 )

InvalidAWSBucketOrDatabase:
The supplied database '[database]' that you have requested to write to does not currently exist.

Create it either directly from within AWS, or by using 'awswrangler.athena.create_athena_bucket' for buckets or 'awswrangler.catalog.create_database' for databases using the awswrangler API.

It looks like the code checks for the db availability here

When I manually run
wr.catalog.databases(boto3_session=boto3_session).values

I see 100 dbs, but not the one I want. When I change it to

wr.catalog.databases(limit=200,boto3_session=boto3_session).values

I do see the db. So there appears to be a bug in the code that checks for the database.

To Reproduce

See in the description

OS:

AWS Sagemaker studio instance

Splink version:

4.0.16

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions