-
Notifications
You must be signed in to change notification settings - Fork 18
Test ddf isin with large list #414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 13 commits
fd32c96
4b94f0d
723d853
c4d167e
065d0cd
3f76775
6c21dff
66fd4a0
548a84b
cb53440
56e68b6
dcc4a52
acdf054
181196a
201bcec
be7deb4
3407ca7
935de43
fc0b7f0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,7 @@ | ||
| from time import time | ||
|
|
||
| import numpy as np | ||
| from dask.datasets import timeseries | ||
| from dask.sizeof import sizeof | ||
| from dask.utils import format_bytes | ||
|
|
||
|
|
@@ -58,3 +62,24 @@ def test_shuffle(small_client): | |
| shuf = df.shuffle(0, shuffle="tasks") | ||
| result = shuf.size | ||
| wait(result, small_client, 20 * 60) | ||
|
|
||
|
|
||
| def test_ddf_isin(small_client): | ||
| """ | ||
| Checks the efficiency of serializing a large list for filtering | ||
| a dask dataframe, and filtering the dataframe by column | ||
| based on that list | ||
| """ | ||
| start = time() | ||
| n = 10_000_000 | ||
| rs = np.random.RandomState(42) | ||
| ddf = timeseries(end="2000-05-01", dtypes={"A": float, "B": int}, seed=42) | ||
| ddf.A = ddf.A.mul(1e7) | ||
| ddf.A = ddf.A.astype(int).persist() | ||
| a_column_unique_values = np.arange(1, n // 10) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick, it looks like we only use
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cleared up by algning |
||
| filter_values_list = sorted( | ||
| rs.choice(a_column_unique_values, len(a_column_unique_values) // 2).tolist() | ||
| ) | ||
| tmp_ddf = ddf.loc[ddf["A"].isin(filter_values_list)] | ||
| wait(tmp_ddf, small_client, 20 * 60) | ||
| print(f"Total time to run test_isin: {time() - start} seconds") | ||
|
hayesgb marked this conversation as resolved.
Outdated
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth a comment here on why we need these next two lines. Is it a cardinality issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments.