Skip to content
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 45 additions & 45 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,52 +31,52 @@ jobs:
matrix:
os: [ubuntu-latest]
python-version: ["3.9"]
pytest_args: [tests]
pytest_args: [tests/benchmarks/test_dataframe.py]
runtime-version: [upstream, latest, "0.0.4", "0.1.0"]
include:
# Run stability tests on Python 3.8
- pytest_args: tests/stability
python-version: "3.8"
runtime-version: upstream
os: ubuntu-latest
- pytest_args: tests/stability
python-version: "3.8"
runtime-version: latest
os: ubuntu-latest
- pytest_args: tests/stability
python-version: "3.8"
runtime-version: "0.0.4"
os: ubuntu-latest
- pytest_args: tests/stability
python-version: "3.8"
runtime-version: "0.1.0"
os: ubuntu-latest
# Run stability tests on Python 3.10
- pytest_args: tests/stability
python-version: "3.10"
runtime-version: upstream
os: ubuntu-latest
- pytest_args: tests/stability
python-version: "3.10"
runtime-version: latest
os: ubuntu-latest
- pytest_args: tests/stability
python-version: "3.10"
runtime-version: "0.0.4"
os: ubuntu-latest
- pytest_args: tests/stability
python-version: "3.10"
runtime-version: "0.1.0"
os: ubuntu-latest
# Run stability tests on Python Windows and MacOS (latest py39 only)
- pytest_args: tests/stability
python-version: "3.9"
runtime-version: latest
os: windows-latest
- pytest_args: tests/stability
python-version: "3.9"
runtime-version: latest
os: macos-latest
# include:
# # Run stability tests on Python 3.8
# - pytest_args: tests/stability
# python-version: "3.8"
# runtime-version: upstream
# os: ubuntu-latest
# - pytest_args: tests/stability
# python-version: "3.8"
# runtime-version: latest
# os: ubuntu-latest
# - pytest_args: tests/stability
# python-version: "3.8"
# runtime-version: "0.0.4"
# os: ubuntu-latest
# - pytest_args: tests/stability
# python-version: "3.8"
# runtime-version: "0.1.0"
# os: ubuntu-latest
# # Run stability tests on Python 3.10
# - pytest_args: tests/stability
# python-version: "3.10"
# runtime-version: upstream
# os: ubuntu-latest
# - pytest_args: tests/stability
# python-version: "3.10"
# runtime-version: latest
# os: ubuntu-latest
# - pytest_args: tests/stability
# python-version: "3.10"
# runtime-version: "0.0.4"
# os: ubuntu-latest
# - pytest_args: tests/stability
# python-version: "3.10"
# runtime-version: "0.1.0"
# os: ubuntu-latest
# # Run stability tests on Python Windows and MacOS (latest py39 only)
# - pytest_args: tests/stability
# python-version: "3.9"
# runtime-version: latest
# os: windows-latest
# - pytest_args: tests/stability
# python-version: "3.9"
# runtime-version: latest
# os: macos-latest

steps:
- name: Checkout
Expand Down
Binary file added tests/benchmark.db
Binary file not shown.
Binary file added tests/benchmarks/benchmark.db
Binary file not shown.
25 changes: 25 additions & 0 deletions tests/benchmarks/test_dataframe.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
from time import time

import numpy as np
from dask.datasets import timeseries
from dask.sizeof import sizeof
from dask.utils import format_bytes

Expand Down Expand Up @@ -58,3 +62,24 @@ def test_shuffle(small_client):
shuf = df.shuffle(0, shuffle="tasks")
result = shuf.size
wait(result, small_client, 20 * 60)


def test_ddf_isin(small_client):
"""
Checks the efficiency of serializing a large list for filtering
a dask dataframe, and filtering the dataframe by column
based on that list
"""
start = time()
n = 10_000_000
rs = np.random.RandomState(42)
ddf = timeseries(end="2000-05-01", dtypes={"A": float, "B": int}, seed=42)
ddf.A = ddf.A.mul(1e7)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth a comment here on why we need these next two lines. Is it a cardinality issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

ddf.A = ddf.A.astype(int).persist()
a_column_unique_values = np.arange(1, n // 10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, it looks like we only use n once, do we need to create a variable (line 71), is this a number that could potentially change? Or did we choose this number arbitrarily?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleared up by algning 1e7 to N. Yes the value could change.

filter_values_list = sorted(
rs.choice(a_column_unique_values, len(a_column_unique_values) // 2).tolist()
)
tmp_ddf = ddf.loc[ddf["A"].isin(filter_values_list)]
wait(tmp_ddf, small_client, 20 * 60)
print(f"Total time to run test_isin: {time() - start} seconds")
Comment thread
hayesgb marked this conversation as resolved.
Outdated