Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Post History
Answer After re-encountering this problem and searching for a solution with a less generic "pandas check if column pair is found in other df" (forgetting I had previously encountered this proble...
Answer
#3: Post edited
- # Answer
- After re-encountering this problem and searching for a solution with a
- less generic "pandas check if column pair is found in other df"
- (forgetting I had previously encountered this problem altogether), I
- found some inspiration that led me to a solid solution.
- It is fastest, and most correct to form a column containing tuple values
- of the value pair in question, and use `isin()` to find the matches.
- This column can be later dropped after use.
- ```py
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"]))
- ```
- ```txt
- Die Cell Current Resistance Die, Cell
- 0 1.0 0.0 100.0 104155.0 (1, 0)
- 1 1.0 0.0 200.0 2542.0 (1, 0)
- 2 1.0 0.0 300.0 206302.0 (1, 0)
- 3 NaN NaN NaN NaN NaN
- 4 NaN NaN NaN NaN NaN
- 5 NaN NaN NaN NaN NaN
- 6 1.0 2.0 100.0 141502.0 (1, 2)
- 7 1.0 2.0 200.0 40422.0 (1, 2)
- 8 1.0 2.0 300.0 31066.0 (1, 2)
- 9 1.0 3.0 100.0 108234.0 (1, 3)
- 10 1.0 3.0 200.0 140520.0 (1, 3)
- 11 1.0 3.0 300.0 43586.0 (1, 3)
- 12 1.0 4.0 100.0 138305.0 (1, 4)
- 13 1.0 4.0 200.0 88725.0 (1, 4)
- 14 1.0 4.0 300.0 239517.0 (1, 4)
- 15 NaN NaN NaN NaN NaN
- 16 NaN NaN NaN NaN NaN
- 17 NaN NaN NaN NaN NaN
- 18 1.0 6.0 100.0 125984.0 (1, 6)
- 19 1.0 6.0 200.0 37036.0 (1, 6)
- 20 1.0 6.0 300.0 179742.0 (1, 6)
- 21 NaN NaN NaN NaN NaN
- 22 NaN NaN NaN NaN NaN
- 23 NaN NaN NaN NaN NaN
- 24 NaN NaN NaN NaN NaN
- 25 NaN NaN NaN NaN NaN
- 26 NaN NaN NaN NaN NaN
- 27 1.0 9.0 100.0 186133.0 (1, 9)
- 28 1.0 9.0 200.0 863.0 (1, 9)
- 29 1.0 9.0 300.0 235060.0 (1, 9)
- 30 2.0 0.0 100.0 217692.0 (2, 0)
- 31 2.0 0.0 200.0 192709.0 (2, 0)
- 32 2.0 0.0 300.0 44718.0 (2, 0)
- 33 NaN NaN NaN NaN NaN
- 34 NaN NaN NaN NaN NaN
- 35 NaN NaN NaN NaN NaN
- 36 2.0 2.0 100.0 144522.0 (2, 2)
- 37 2.0 2.0 200.0 184146.0 (2, 2)
- 38 2.0 2.0 300.0 58155.0 (2, 2)
- 39 2.0 3.0 100.0 130899.0 (2, 3)
- 40 2.0 3.0 200.0 177347.0 (2, 3)
- 41 2.0 3.0 300.0 206209.0 (2, 3)
- 42 2.0 4.0 100.0 201781.0 (2, 4)
- 43 2.0 4.0 200.0 58077.0 (2, 4)
- 44 2.0 4.0 300.0 218298.0 (2, 4)
- 45 NaN NaN NaN NaN NaN
- 46 NaN NaN NaN NaN NaN
- 47 NaN NaN NaN NaN NaN
- 48 NaN NaN NaN NaN NaN
- 49 NaN NaN NaN NaN NaN
- 50 NaN NaN NaN NaN NaN
- 51 2.0 7.0 100.0 239666.0 (2, 7)
- 52 2.0 7.0 200.0 10384.0 (2, 7)
- 53 2.0 7.0 300.0 41034.0 (2, 7)
- 54 2.0 8.0 100.0 245824.0 (2, 8)
- 55 2.0 8.0 200.0 208052.0 (2, 8)
- 56 2.0 8.0 300.0 37568.0 (2, 8)
- 57 NaN NaN NaN NaN NaN
- 58 NaN NaN NaN NaN NaN
- 59 NaN NaN NaN NaN NaN
- ```
- # Timings
- I determined it to be fastest by testing the following methods:
- ```py
- import time
- import random
- import numpy as np
- import pandas as pd
- from itertools import product
- random.seed(12345)
- # dies = [1, 2]
- # cells = list(range(10))
- # currents = [100, 200, 300]
- dies = list(range(10))
- cells = list(range(1000))
- currents = list(range(100, 800 + 100, 100))
- dcc = list(product(dies, cells, currents))
- resistances = random.choices(range(250000 + 1), k=len(dcc))
- df = pd.DataFrame(dcc, columns=["Die", "Cell", "Current"])
- df["Resistance"] = resistances
- b100 = df[(df["Current"] == 100) & (df["Resistance"] < 100000)]
- print(f"df:\n{df}\n")
- print(f"b100:\n{b100}\n")
- # --
- df_og = df.copy()
- b100_og = b100.copy()
- # --
- start = time.perf_counter()
- rm_dies = b100["Die"].to_list()
- rm_cells = b100["Cell"].to_list()
- for die, cell in zip(rm_dies, rm_cells):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- rm_dies = b100["Die"].to_list()
- rm_cells = b100["Cell"].to_list()
- for die, cell in zip(rm_dies, rm_cells):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_no_dup_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- for die, cell in zip(b100["Die"], b100["Cell"]):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_no_dup_zip_time = stop - start
- # --
- # INCORRECT
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- df = df.set_index(["Die", "Cell", "Current"])
- df.loc[(b100["Die"], b100["Cell"], slice(None)), :] = np.nan
- stop = time.perf_counter()
- slice_time = stop - start
- # INCORRECT
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"]))
- stop = time.perf_counter()
- tuple_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(
- df[["Die", "Cell"]].apply(tuple, axis="columns").isin(
- b100[["Die", "Cell"]].apply(tuple, axis="columns")))
- stop = time.perf_counter()
- tuple_fly_time = stop - start
- # --
- print(f"Loop time: {loop_time:.2f} seconds")
- print(f"Loop no dup time: {loop_no_dup_time:.2f} seconds")
- print(f"Loop no dup zip time: {loop_no_dup_zip_time:.2f} seconds")
- print(f"Slice time: {slice_time:.2f} seconds")
- print(f"Tuple time: {tuple_time:.2f} seconds")
- print(f"Tuple fly time: {tuple_fly_time:.2f} seconds")
- ```
- ```txt
Loop time: 14.78 secondsLoop no dup time: 14.60 secondsLoop no dup zip time: 15.54 seconds- Slice time: 0.16 seconds
Tuple time: 0.04 secondsTuple fly time: 0.31 seconds- Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
- ```
- ## Notes
- Slicing is a close second on speed, but produces an incorrect result.
- Same behavior as (mentioned in [Tried](https://software.codidact.com/posts/288068)):
- ```py
- df = df.mask((df["Die"].isin(b100["Die"])) & (df["Cell"].isin(b100["Cell"])))
- ```
- The use of `drop_duplicates()` is technically not needed, since `b100`
- is created at a single current, with no duplicate die, cell pairs.
- Despite this, it is left in place for my own convenience.
- ---
- Inspired by:
- https://stackoverflow.com/a/16068497
- https://stackoverflow.com/a/70355304
- # Answer
- After re-encountering this problem and searching for a solution with a
- less generic "pandas check if column pair is found in other df"
- (forgetting I had previously encountered this problem altogether), I
- found some inspiration that led me to a solid solution.
- It is fastest, and most correct to form a column containing tuple values
- of the value pair in question, and use `isin()` to find the matches.
- This column can be later dropped after use.
- ```py
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"]))
- ```
- ```txt
- Die Cell Current Resistance Die, Cell
- 0 1.0 0.0 100.0 104155.0 (1, 0)
- 1 1.0 0.0 200.0 2542.0 (1, 0)
- 2 1.0 0.0 300.0 206302.0 (1, 0)
- 3 NaN NaN NaN NaN NaN
- 4 NaN NaN NaN NaN NaN
- 5 NaN NaN NaN NaN NaN
- 6 1.0 2.0 100.0 141502.0 (1, 2)
- 7 1.0 2.0 200.0 40422.0 (1, 2)
- 8 1.0 2.0 300.0 31066.0 (1, 2)
- 9 1.0 3.0 100.0 108234.0 (1, 3)
- 10 1.0 3.0 200.0 140520.0 (1, 3)
- 11 1.0 3.0 300.0 43586.0 (1, 3)
- 12 1.0 4.0 100.0 138305.0 (1, 4)
- 13 1.0 4.0 200.0 88725.0 (1, 4)
- 14 1.0 4.0 300.0 239517.0 (1, 4)
- 15 NaN NaN NaN NaN NaN
- 16 NaN NaN NaN NaN NaN
- 17 NaN NaN NaN NaN NaN
- 18 1.0 6.0 100.0 125984.0 (1, 6)
- 19 1.0 6.0 200.0 37036.0 (1, 6)
- 20 1.0 6.0 300.0 179742.0 (1, 6)
- 21 NaN NaN NaN NaN NaN
- 22 NaN NaN NaN NaN NaN
- 23 NaN NaN NaN NaN NaN
- 24 NaN NaN NaN NaN NaN
- 25 NaN NaN NaN NaN NaN
- 26 NaN NaN NaN NaN NaN
- 27 1.0 9.0 100.0 186133.0 (1, 9)
- 28 1.0 9.0 200.0 863.0 (1, 9)
- 29 1.0 9.0 300.0 235060.0 (1, 9)
- 30 2.0 0.0 100.0 217692.0 (2, 0)
- 31 2.0 0.0 200.0 192709.0 (2, 0)
- 32 2.0 0.0 300.0 44718.0 (2, 0)
- 33 NaN NaN NaN NaN NaN
- 34 NaN NaN NaN NaN NaN
- 35 NaN NaN NaN NaN NaN
- 36 2.0 2.0 100.0 144522.0 (2, 2)
- 37 2.0 2.0 200.0 184146.0 (2, 2)
- 38 2.0 2.0 300.0 58155.0 (2, 2)
- 39 2.0 3.0 100.0 130899.0 (2, 3)
- 40 2.0 3.0 200.0 177347.0 (2, 3)
- 41 2.0 3.0 300.0 206209.0 (2, 3)
- 42 2.0 4.0 100.0 201781.0 (2, 4)
- 43 2.0 4.0 200.0 58077.0 (2, 4)
- 44 2.0 4.0 300.0 218298.0 (2, 4)
- 45 NaN NaN NaN NaN NaN
- 46 NaN NaN NaN NaN NaN
- 47 NaN NaN NaN NaN NaN
- 48 NaN NaN NaN NaN NaN
- 49 NaN NaN NaN NaN NaN
- 50 NaN NaN NaN NaN NaN
- 51 2.0 7.0 100.0 239666.0 (2, 7)
- 52 2.0 7.0 200.0 10384.0 (2, 7)
- 53 2.0 7.0 300.0 41034.0 (2, 7)
- 54 2.0 8.0 100.0 245824.0 (2, 8)
- 55 2.0 8.0 200.0 208052.0 (2, 8)
- 56 2.0 8.0 300.0 37568.0 (2, 8)
- 57 NaN NaN NaN NaN NaN
- 58 NaN NaN NaN NaN NaN
- 59 NaN NaN NaN NaN NaN
- ```
- # Timings
- I determined it to be fastest by testing the following methods:
- ```py
- import time
- import random
- import numpy as np
- import pandas as pd
- from itertools import product
- random.seed(12345)
- # dies = [1, 2]
- # cells = list(range(10))
- # currents = [100, 200, 300]
- dies = list(range(10))
- cells = list(range(1000))
- currents = list(range(100, 800 + 100, 100))
- dcc = list(product(dies, cells, currents))
- resistances = random.choices(range(250000 + 1), k=len(dcc))
- df = pd.DataFrame(dcc, columns=["Die", "Cell", "Current"])
- df["Resistance"] = resistances
- b100 = df[(df["Current"] == 100) & (df["Resistance"] < 100000)]
- print(f"df:\n{df}\n")
- print(f"b100:\n{b100}\n")
- # --
- df_og = df.copy()
- b100_og = b100.copy()
- # --
- start = time.perf_counter()
- rm_dies = b100["Die"].to_list()
- rm_cells = b100["Cell"].to_list()
- for die, cell in zip(rm_dies, rm_cells):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- rm_dies = b100["Die"].to_list()
- rm_cells = b100["Cell"].to_list()
- for die, cell in zip(rm_dies, rm_cells):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_no_dup_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- for die, cell in zip(b100["Die"], b100["Cell"]):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_no_dup_zip_time = stop - start
- # --
- # INCORRECT
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- df = df.set_index(["Die", "Cell", "Current"])
- df.loc[(b100["Die"], b100["Cell"], slice(None)), :] = np.nan
- stop = time.perf_counter()
- slice_time = stop - start
- # INCORRECT
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"]))
- stop = time.perf_counter()
- tuple_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(
- df[["Die", "Cell"]].apply(tuple, axis="columns").isin(
- b100[["Die", "Cell"]].apply(tuple, axis="columns")))
- stop = time.perf_counter()
- tuple_fly_time = stop - start
- # --
- print(f"Loop time: {loop_time:.2f} seconds")
- print(f"Loop no dup time: {loop_no_dup_time:.2f} seconds")
- print(f"Loop no dup zip time: {loop_no_dup_zip_time:.2f} seconds")
- print(f"Slice time: {slice_time:.2f} seconds")
- print(f"Tuple time: {tuple_time:.2f} seconds")
- print(f"Tuple fly time: {tuple_fly_time:.2f} seconds")
- ```
- ```txt
- Loop time: 14.01 seconds
- Loop no dup time: 14.66 seconds
- Loop no dup zip time: 16.66 seconds
- Slice time: 0.16 seconds
- Tuple time: 0.03 seconds
- Tuple fly time: 0.28 seconds
- Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
- ```
- ## Notes
- Slicing is a close second on speed, but produces an incorrect result.
- Same behavior as (mentioned in [Tried](https://software.codidact.com/posts/288068)):
- ```py
- df = df.mask((df["Die"].isin(b100["Die"])) & (df["Cell"].isin(b100["Cell"])))
- ```
- The use of `drop_duplicates()` is technically not needed, since `b100`
- is created at a single current, with no duplicate die, cell pairs.
- Despite this, it is left in place for my own convenience.
- ---
- Inspired by:
- https://stackoverflow.com/a/16068497
- https://stackoverflow.com/a/70355304
#2: Post edited
- # Answer
- After re-encountering this problem and searching for a solution with a
- less generic "pandas check if column pair is found in other df"
- (forgetting I had previously encountered this problem altogether), I
- found some inspiration that led me to a solid solution.
- It is fastest, and most correct to form a column containing tuple values
of the value pair in question, and using `isin()` to find the matches.- This column can be later dropped after use.
- ```py
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"]))
- ```
- ```txt
- Die Cell Current Resistance Die, Cell
- 0 1.0 0.0 100.0 104155.0 (1, 0)
- 1 1.0 0.0 200.0 2542.0 (1, 0)
- 2 1.0 0.0 300.0 206302.0 (1, 0)
- 3 NaN NaN NaN NaN NaN
- 4 NaN NaN NaN NaN NaN
- 5 NaN NaN NaN NaN NaN
- 6 1.0 2.0 100.0 141502.0 (1, 2)
- 7 1.0 2.0 200.0 40422.0 (1, 2)
- 8 1.0 2.0 300.0 31066.0 (1, 2)
- 9 1.0 3.0 100.0 108234.0 (1, 3)
- 10 1.0 3.0 200.0 140520.0 (1, 3)
- 11 1.0 3.0 300.0 43586.0 (1, 3)
- 12 1.0 4.0 100.0 138305.0 (1, 4)
- 13 1.0 4.0 200.0 88725.0 (1, 4)
- 14 1.0 4.0 300.0 239517.0 (1, 4)
- 15 NaN NaN NaN NaN NaN
- 16 NaN NaN NaN NaN NaN
- 17 NaN NaN NaN NaN NaN
- 18 1.0 6.0 100.0 125984.0 (1, 6)
- 19 1.0 6.0 200.0 37036.0 (1, 6)
- 20 1.0 6.0 300.0 179742.0 (1, 6)
- 21 NaN NaN NaN NaN NaN
- 22 NaN NaN NaN NaN NaN
- 23 NaN NaN NaN NaN NaN
- 24 NaN NaN NaN NaN NaN
- 25 NaN NaN NaN NaN NaN
- 26 NaN NaN NaN NaN NaN
- 27 1.0 9.0 100.0 186133.0 (1, 9)
- 28 1.0 9.0 200.0 863.0 (1, 9)
- 29 1.0 9.0 300.0 235060.0 (1, 9)
- 30 2.0 0.0 100.0 217692.0 (2, 0)
- 31 2.0 0.0 200.0 192709.0 (2, 0)
- 32 2.0 0.0 300.0 44718.0 (2, 0)
- 33 NaN NaN NaN NaN NaN
- 34 NaN NaN NaN NaN NaN
- 35 NaN NaN NaN NaN NaN
- 36 2.0 2.0 100.0 144522.0 (2, 2)
- 37 2.0 2.0 200.0 184146.0 (2, 2)
- 38 2.0 2.0 300.0 58155.0 (2, 2)
- 39 2.0 3.0 100.0 130899.0 (2, 3)
- 40 2.0 3.0 200.0 177347.0 (2, 3)
- 41 2.0 3.0 300.0 206209.0 (2, 3)
- 42 2.0 4.0 100.0 201781.0 (2, 4)
- 43 2.0 4.0 200.0 58077.0 (2, 4)
- 44 2.0 4.0 300.0 218298.0 (2, 4)
- 45 NaN NaN NaN NaN NaN
- 46 NaN NaN NaN NaN NaN
- 47 NaN NaN NaN NaN NaN
- 48 NaN NaN NaN NaN NaN
- 49 NaN NaN NaN NaN NaN
- 50 NaN NaN NaN NaN NaN
- 51 2.0 7.0 100.0 239666.0 (2, 7)
- 52 2.0 7.0 200.0 10384.0 (2, 7)
- 53 2.0 7.0 300.0 41034.0 (2, 7)
- 54 2.0 8.0 100.0 245824.0 (2, 8)
- 55 2.0 8.0 200.0 208052.0 (2, 8)
- 56 2.0 8.0 300.0 37568.0 (2, 8)
- 57 NaN NaN NaN NaN NaN
- 58 NaN NaN NaN NaN NaN
- 59 NaN NaN NaN NaN NaN
- ```
- # Timings
- I determined it to be fastest by testing the following methods:
- ```py
- import time
- import random
- import numpy as np
- import pandas as pd
- from itertools import product
- random.seed(12345)
- # dies = [1, 2]
- # cells = list(range(10))
- # currents = [100, 200, 300]
- dies = list(range(10))
- cells = list(range(1000))
- currents = list(range(100, 800 + 100, 100))
- dcc = list(product(dies, cells, currents))
- resistances = random.choices(range(250000 + 1), k=len(dcc))
- df = pd.DataFrame(dcc, columns=["Die", "Cell", "Current"])
- df["Resistance"] = resistances
- b100 = df[(df["Current"] == 100) & (df["Resistance"] < 100000)]
- print(f"df:\n{df}\n")
- print(f"b100:\n{b100}\n")
- # --
- df_og = df.copy()
- b100_og = b100.copy()
- # --
- start = time.perf_counter()
- rm_dies = b100["Die"].to_list()
- rm_cells = b100["Cell"].to_list()
- for die, cell in zip(rm_dies, rm_cells):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- rm_dies = b100["Die"].to_list()
- rm_cells = b100["Cell"].to_list()
- for die, cell in zip(rm_dies, rm_cells):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_no_dup_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- for die, cell in zip(b100["Die"], b100["Cell"]):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_no_dup_zip_time = stop - start
- # --
- # INCORRECT
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- df = df.set_index(["Die", "Cell", "Current"])
- df.loc[(b100["Die"], b100["Cell"], slice(None)), :] = np.nan
- stop = time.perf_counter()
- slice_time = stop - start
- # INCORRECT
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"]))
- stop = time.perf_counter()
- tuple_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(
- df[["Die", "Cell"]].apply(tuple, axis="columns").isin(
- b100[["Die", "Cell"]].apply(tuple, axis="columns")))
- stop = time.perf_counter()
- tuple_fly_time = stop - start
- # --
- print(f"Loop time: {loop_time:.2f} seconds")
- print(f"Loop no dup time: {loop_no_dup_time:.2f} seconds")
- print(f"Loop no dup zip time: {loop_no_dup_zip_time:.2f} seconds")
- print(f"Slice time: {slice_time:.2f} seconds")
- print(f"Tuple time: {tuple_time:.2f} seconds")
- print(f"Tuple fly time: {tuple_fly_time:.2f} seconds")
- ```
- ```txt
- Loop time: 14.78 seconds
- Loop no dup time: 14.60 seconds
- Loop no dup zip time: 15.54 seconds
- Slice time: 0.16 seconds
- Tuple time: 0.04 seconds
- Tuple fly time: 0.31 seconds
- Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
- ```
- ## Notes
- Slicing is a close second on speed, but produces an incorrect result.
Same behavior as:- ```py
- df = df.mask((df["Die"].isin(b100["Die"])) & (df["Cell"].isin(b100["Cell"])))
- ```
- The use of `drop_duplicates()` is technically not needed, since `b100`
- is created at a single current, with no duplicate die, cell pairs.
- Despite this, it is left in place for my own convenience.
- ---
- Inspired by:
- https://stackoverflow.com/a/16068497
- https://stackoverflow.com/a/70355304
- # Answer
- After re-encountering this problem and searching for a solution with a
- less generic "pandas check if column pair is found in other df"
- (forgetting I had previously encountered this problem altogether), I
- found some inspiration that led me to a solid solution.
- It is fastest, and most correct to form a column containing tuple values
- of the value pair in question, and use `isin()` to find the matches.
- This column can be later dropped after use.
- ```py
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"]))
- ```
- ```txt
- Die Cell Current Resistance Die, Cell
- 0 1.0 0.0 100.0 104155.0 (1, 0)
- 1 1.0 0.0 200.0 2542.0 (1, 0)
- 2 1.0 0.0 300.0 206302.0 (1, 0)
- 3 NaN NaN NaN NaN NaN
- 4 NaN NaN NaN NaN NaN
- 5 NaN NaN NaN NaN NaN
- 6 1.0 2.0 100.0 141502.0 (1, 2)
- 7 1.0 2.0 200.0 40422.0 (1, 2)
- 8 1.0 2.0 300.0 31066.0 (1, 2)
- 9 1.0 3.0 100.0 108234.0 (1, 3)
- 10 1.0 3.0 200.0 140520.0 (1, 3)
- 11 1.0 3.0 300.0 43586.0 (1, 3)
- 12 1.0 4.0 100.0 138305.0 (1, 4)
- 13 1.0 4.0 200.0 88725.0 (1, 4)
- 14 1.0 4.0 300.0 239517.0 (1, 4)
- 15 NaN NaN NaN NaN NaN
- 16 NaN NaN NaN NaN NaN
- 17 NaN NaN NaN NaN NaN
- 18 1.0 6.0 100.0 125984.0 (1, 6)
- 19 1.0 6.0 200.0 37036.0 (1, 6)
- 20 1.0 6.0 300.0 179742.0 (1, 6)
- 21 NaN NaN NaN NaN NaN
- 22 NaN NaN NaN NaN NaN
- 23 NaN NaN NaN NaN NaN
- 24 NaN NaN NaN NaN NaN
- 25 NaN NaN NaN NaN NaN
- 26 NaN NaN NaN NaN NaN
- 27 1.0 9.0 100.0 186133.0 (1, 9)
- 28 1.0 9.0 200.0 863.0 (1, 9)
- 29 1.0 9.0 300.0 235060.0 (1, 9)
- 30 2.0 0.0 100.0 217692.0 (2, 0)
- 31 2.0 0.0 200.0 192709.0 (2, 0)
- 32 2.0 0.0 300.0 44718.0 (2, 0)
- 33 NaN NaN NaN NaN NaN
- 34 NaN NaN NaN NaN NaN
- 35 NaN NaN NaN NaN NaN
- 36 2.0 2.0 100.0 144522.0 (2, 2)
- 37 2.0 2.0 200.0 184146.0 (2, 2)
- 38 2.0 2.0 300.0 58155.0 (2, 2)
- 39 2.0 3.0 100.0 130899.0 (2, 3)
- 40 2.0 3.0 200.0 177347.0 (2, 3)
- 41 2.0 3.0 300.0 206209.0 (2, 3)
- 42 2.0 4.0 100.0 201781.0 (2, 4)
- 43 2.0 4.0 200.0 58077.0 (2, 4)
- 44 2.0 4.0 300.0 218298.0 (2, 4)
- 45 NaN NaN NaN NaN NaN
- 46 NaN NaN NaN NaN NaN
- 47 NaN NaN NaN NaN NaN
- 48 NaN NaN NaN NaN NaN
- 49 NaN NaN NaN NaN NaN
- 50 NaN NaN NaN NaN NaN
- 51 2.0 7.0 100.0 239666.0 (2, 7)
- 52 2.0 7.0 200.0 10384.0 (2, 7)
- 53 2.0 7.0 300.0 41034.0 (2, 7)
- 54 2.0 8.0 100.0 245824.0 (2, 8)
- 55 2.0 8.0 200.0 208052.0 (2, 8)
- 56 2.0 8.0 300.0 37568.0 (2, 8)
- 57 NaN NaN NaN NaN NaN
- 58 NaN NaN NaN NaN NaN
- 59 NaN NaN NaN NaN NaN
- ```
- # Timings
- I determined it to be fastest by testing the following methods:
- ```py
- import time
- import random
- import numpy as np
- import pandas as pd
- from itertools import product
- random.seed(12345)
- # dies = [1, 2]
- # cells = list(range(10))
- # currents = [100, 200, 300]
- dies = list(range(10))
- cells = list(range(1000))
- currents = list(range(100, 800 + 100, 100))
- dcc = list(product(dies, cells, currents))
- resistances = random.choices(range(250000 + 1), k=len(dcc))
- df = pd.DataFrame(dcc, columns=["Die", "Cell", "Current"])
- df["Resistance"] = resistances
- b100 = df[(df["Current"] == 100) & (df["Resistance"] < 100000)]
- print(f"df:\n{df}\n")
- print(f"b100:\n{b100}\n")
- # --
- df_og = df.copy()
- b100_og = b100.copy()
- # --
- start = time.perf_counter()
- rm_dies = b100["Die"].to_list()
- rm_cells = b100["Cell"].to_list()
- for die, cell in zip(rm_dies, rm_cells):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- rm_dies = b100["Die"].to_list()
- rm_cells = b100["Cell"].to_list()
- for die, cell in zip(rm_dies, rm_cells):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_no_dup_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- for die, cell in zip(b100["Die"], b100["Cell"]):
- df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
- stop = time.perf_counter()
- loop_no_dup_zip_time = stop - start
- # --
- # INCORRECT
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- b100 = b100.drop_duplicates(["Die", "Cell"])
- df = df.set_index(["Die", "Cell", "Current"])
- df.loc[(b100["Die"], b100["Cell"], slice(None)), :] = np.nan
- stop = time.perf_counter()
- slice_time = stop - start
- # INCORRECT
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"]))
- stop = time.perf_counter()
- tuple_time = stop - start
- # --
- df = df_og.copy()
- b100 = b100_og.copy()
- start = time.perf_counter()
- df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
- b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
- df = df.mask(
- df[["Die", "Cell"]].apply(tuple, axis="columns").isin(
- b100[["Die", "Cell"]].apply(tuple, axis="columns")))
- stop = time.perf_counter()
- tuple_fly_time = stop - start
- # --
- print(f"Loop time: {loop_time:.2f} seconds")
- print(f"Loop no dup time: {loop_no_dup_time:.2f} seconds")
- print(f"Loop no dup zip time: {loop_no_dup_zip_time:.2f} seconds")
- print(f"Slice time: {slice_time:.2f} seconds")
- print(f"Tuple time: {tuple_time:.2f} seconds")
- print(f"Tuple fly time: {tuple_fly_time:.2f} seconds")
- ```
- ```txt
- Loop time: 14.78 seconds
- Loop no dup time: 14.60 seconds
- Loop no dup zip time: 15.54 seconds
- Slice time: 0.16 seconds
- Tuple time: 0.04 seconds
- Tuple fly time: 0.31 seconds
- Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
- ```
- ## Notes
- Slicing is a close second on speed, but produces an incorrect result.
- Same behavior as (mentioned in [Tried](https://software.codidact.com/posts/288068)):
- ```py
- df = df.mask((df["Die"].isin(b100["Die"])) & (df["Cell"].isin(b100["Cell"])))
- ```
- The use of `drop_duplicates()` is technically not needed, since `b100`
- is created at a single current, with no duplicate die, cell pairs.
- Despite this, it is left in place for my own convenience.
- ---
- Inspired by:
- https://stackoverflow.com/a/16068497
- https://stackoverflow.com/a/70355304
#1: Initial revision
# Answer After re-encountering this problem and searching for a solution with a less generic "pandas check if column pair is found in other df" (forgetting I had previously encountered this problem altogether), I found some inspiration that led me to a solid solution. It is fastest, and most correct to form a column containing tuple values of the value pair in question, and using `isin()` to find the matches. This column can be later dropped after use. ```py df["Die, Cell"] = list(zip(df["Die"], df["Cell"])) b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"])) df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"])) ``` ```txt Die Cell Current Resistance Die, Cell 0 1.0 0.0 100.0 104155.0 (1, 0) 1 1.0 0.0 200.0 2542.0 (1, 0) 2 1.0 0.0 300.0 206302.0 (1, 0) 3 NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN NaN 6 1.0 2.0 100.0 141502.0 (1, 2) 7 1.0 2.0 200.0 40422.0 (1, 2) 8 1.0 2.0 300.0 31066.0 (1, 2) 9 1.0 3.0 100.0 108234.0 (1, 3) 10 1.0 3.0 200.0 140520.0 (1, 3) 11 1.0 3.0 300.0 43586.0 (1, 3) 12 1.0 4.0 100.0 138305.0 (1, 4) 13 1.0 4.0 200.0 88725.0 (1, 4) 14 1.0 4.0 300.0 239517.0 (1, 4) 15 NaN NaN NaN NaN NaN 16 NaN NaN NaN NaN NaN 17 NaN NaN NaN NaN NaN 18 1.0 6.0 100.0 125984.0 (1, 6) 19 1.0 6.0 200.0 37036.0 (1, 6) 20 1.0 6.0 300.0 179742.0 (1, 6) 21 NaN NaN NaN NaN NaN 22 NaN NaN NaN NaN NaN 23 NaN NaN NaN NaN NaN 24 NaN NaN NaN NaN NaN 25 NaN NaN NaN NaN NaN 26 NaN NaN NaN NaN NaN 27 1.0 9.0 100.0 186133.0 (1, 9) 28 1.0 9.0 200.0 863.0 (1, 9) 29 1.0 9.0 300.0 235060.0 (1, 9) 30 2.0 0.0 100.0 217692.0 (2, 0) 31 2.0 0.0 200.0 192709.0 (2, 0) 32 2.0 0.0 300.0 44718.0 (2, 0) 33 NaN NaN NaN NaN NaN 34 NaN NaN NaN NaN NaN 35 NaN NaN NaN NaN NaN 36 2.0 2.0 100.0 144522.0 (2, 2) 37 2.0 2.0 200.0 184146.0 (2, 2) 38 2.0 2.0 300.0 58155.0 (2, 2) 39 2.0 3.0 100.0 130899.0 (2, 3) 40 2.0 3.0 200.0 177347.0 (2, 3) 41 2.0 3.0 300.0 206209.0 (2, 3) 42 2.0 4.0 100.0 201781.0 (2, 4) 43 2.0 4.0 200.0 58077.0 (2, 4) 44 2.0 4.0 300.0 218298.0 (2, 4) 45 NaN NaN NaN NaN NaN 46 NaN NaN NaN NaN NaN 47 NaN NaN NaN NaN NaN 48 NaN NaN NaN NaN NaN 49 NaN NaN NaN NaN NaN 50 NaN NaN NaN NaN NaN 51 2.0 7.0 100.0 239666.0 (2, 7) 52 2.0 7.0 200.0 10384.0 (2, 7) 53 2.0 7.0 300.0 41034.0 (2, 7) 54 2.0 8.0 100.0 245824.0 (2, 8) 55 2.0 8.0 200.0 208052.0 (2, 8) 56 2.0 8.0 300.0 37568.0 (2, 8) 57 NaN NaN NaN NaN NaN 58 NaN NaN NaN NaN NaN 59 NaN NaN NaN NaN NaN ``` # Timings I determined it to be fastest by testing the following methods: ```py import time import random import numpy as np import pandas as pd from itertools import product random.seed(12345) # dies = [1, 2] # cells = list(range(10)) # currents = [100, 200, 300] dies = list(range(10)) cells = list(range(1000)) currents = list(range(100, 800 + 100, 100)) dcc = list(product(dies, cells, currents)) resistances = random.choices(range(250000 + 1), k=len(dcc)) df = pd.DataFrame(dcc, columns=["Die", "Cell", "Current"]) df["Resistance"] = resistances b100 = df[(df["Current"] == 100) & (df["Resistance"] < 100000)] print(f"df:\n{df}\n") print(f"b100:\n{b100}\n") # -- df_og = df.copy() b100_og = b100.copy() # -- start = time.perf_counter() rm_dies = b100["Die"].to_list() rm_cells = b100["Cell"].to_list() for die, cell in zip(rm_dies, rm_cells): df = df.mask((df["Die"] == die) & (df["Cell"] == cell)) stop = time.perf_counter() loop_time = stop - start # -- df = df_og.copy() b100 = b100_og.copy() start = time.perf_counter() b100 = b100.drop_duplicates(["Die", "Cell"]) rm_dies = b100["Die"].to_list() rm_cells = b100["Cell"].to_list() for die, cell in zip(rm_dies, rm_cells): df = df.mask((df["Die"] == die) & (df["Cell"] == cell)) stop = time.perf_counter() loop_no_dup_time = stop - start # -- df = df_og.copy() b100 = b100_og.copy() start = time.perf_counter() b100 = b100.drop_duplicates(["Die", "Cell"]) for die, cell in zip(b100["Die"], b100["Cell"]): df = df.mask((df["Die"] == die) & (df["Cell"] == cell)) stop = time.perf_counter() loop_no_dup_zip_time = stop - start # -- # INCORRECT df = df_og.copy() b100 = b100_og.copy() start = time.perf_counter() b100 = b100.drop_duplicates(["Die", "Cell"]) df = df.set_index(["Die", "Cell", "Current"]) df.loc[(b100["Die"], b100["Cell"], slice(None)), :] = np.nan stop = time.perf_counter() slice_time = stop - start # INCORRECT # -- df = df_og.copy() b100 = b100_og.copy() start = time.perf_counter() df["Die, Cell"] = list(zip(df["Die"], df["Cell"])) b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"])) df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"])) stop = time.perf_counter() tuple_time = stop - start # -- df = df_og.copy() b100 = b100_og.copy() start = time.perf_counter() df["Die, Cell"] = list(zip(df["Die"], df["Cell"])) b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"])) df = df.mask( df[["Die", "Cell"]].apply(tuple, axis="columns").isin( b100[["Die", "Cell"]].apply(tuple, axis="columns"))) stop = time.perf_counter() tuple_fly_time = stop - start # -- print(f"Loop time: {loop_time:.2f} seconds") print(f"Loop no dup time: {loop_no_dup_time:.2f} seconds") print(f"Loop no dup zip time: {loop_no_dup_zip_time:.2f} seconds") print(f"Slice time: {slice_time:.2f} seconds") print(f"Tuple time: {tuple_time:.2f} seconds") print(f"Tuple fly time: {tuple_fly_time:.2f} seconds") ``` ```txt Loop time: 14.78 seconds Loop no dup time: 14.60 seconds Loop no dup zip time: 15.54 seconds Slice time: 0.16 seconds Tuple time: 0.04 seconds Tuple fly time: 0.31 seconds Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz ``` ## Notes Slicing is a close second on speed, but produces an incorrect result. Same behavior as: ```py df = df.mask((df["Die"].isin(b100["Die"])) & (df["Cell"].isin(b100["Cell"]))) ``` The use of `drop_duplicates()` is technically not needed, since `b100` is created at a single current, with no duplicate die, cell pairs. Despite this, it is left in place for my own convenience. --- Inspired by: https://stackoverflow.com/a/16068497 https://stackoverflow.com/a/70355304