Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Remove entries by two-column ID everywhere, that meet a condition somewhere
MWE
import random
import pandas as pd
from itertools import product
random.seed(12345)
dies = [1, 2]
cells = list(range(10))
currents = [100, 200, 300]
dcc = list(product(dies, cells, currents))
resistances = random.choices(range(250000 + 1), k=len(dcc))
df = pd.DataFrame(dcc, columns=["Die", "Cell", "Current"])
df["Resistance"] = resistances
b100 = df[(df["Current"] == 100) & (df["Resistance"] < 100000)]
df:
Die Cell Current Resistance
0 1 0 100 104155
1 1 0 200 2542
2 1 0 300 206302
3 1 1 100 74660
4 1 1 200 92103
5 1 1 300 48415
6 1 2 100 141502
7 1 2 200 40422
8 1 2 300 31066
9 1 3 100 108234
10 1 3 200 140520
11 1 3 300 43586
12 1 4 100 138305
13 1 4 200 88725
14 1 4 300 239517
15 1 5 100 22823
16 1 5 200 244660
17 1 5 300 103030
18 1 6 100 125984
19 1 6 200 37036
20 1 6 300 179742
21 1 7 100 47493
22 1 7 200 85390
23 1 7 300 5880
24 1 8 100 84879
25 1 8 200 241871
26 1 8 300 244700
27 1 9 100 186133
28 1 9 200 863
29 1 9 300 235060
30 2 0 100 217692
31 2 0 200 192709
32 2 0 300 44718
33 2 1 100 24875
34 2 1 200 103633
35 2 1 300 221385
36 2 2 100 144522
37 2 2 200 184146
38 2 2 300 58155
39 2 3 100 130899
40 2 3 200 177347
41 2 3 300 206209
42 2 4 100 201781
43 2 4 200 58077
44 2 4 300 218298
45 2 5 100 54095
46 2 5 200 200475
47 2 5 300 138771
48 2 6 100 46457
49 2 6 200 147152
50 2 6 300 129560
51 2 7 100 239666
52 2 7 200 10384
53 2 7 300 41034
54 2 8 100 245824
55 2 8 200 208052
56 2 8 300 37568
57 2 9 100 57278
58 2 9 200 134785
59 2 9 300 39245
b100:
Die Cell Current Resistance
3 1 1 100 74660
15 1 5 100 22823
21 1 7 100 47493
24 1 8 100 84879
33 2 1 100 24875
45 2 5 100 54095
48 2 6 100 46457
57 2 9 100 57278
Problem
If a cell is below 100k resistance, at 100 current, I want to remove
that cell everywhere, within that die. b100
above shows that I want to
remove all entries for cells 1, 5, 7, and 8 in die 1, and all entries
for cells 1, 5, 6, and 9 in die 2.
Cells 6 and 9 should not be removed from die 1, and cells 7 and 8 should not be removed from die 2.
How would I do this programatically?
Tried
Tried:
df = df.mask((df["Die"].isin(b100["Die"])) & (df["Cell"].isin(b100["Cell"])))
But this removes every cell marked, in either die:
Die Cell Current Resistance
0 1.0 0.0 100.0 104155.0
1 1.0 0.0 200.0 2542.0
2 1.0 0.0 300.0 206302.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 1.0 2.0 100.0 141502.0
7 1.0 2.0 200.0 40422.0
8 1.0 2.0 300.0 31066.0
9 1.0 3.0 100.0 108234.0
10 1.0 3.0 200.0 140520.0
11 1.0 3.0 300.0 43586.0
12 1.0 4.0 100.0 138305.0
13 1.0 4.0 200.0 88725.0
14 1.0 4.0 300.0 239517.0
15 NaN NaN NaN NaN
16 NaN NaN NaN NaN
17 NaN NaN NaN NaN
18 NaN NaN NaN NaN
19 NaN NaN NaN NaN
20 NaN NaN NaN NaN
21 NaN NaN NaN NaN
22 NaN NaN NaN NaN
23 NaN NaN NaN NaN
24 NaN NaN NaN NaN
25 NaN NaN NaN NaN
26 NaN NaN NaN NaN
27 NaN NaN NaN NaN
28 NaN NaN NaN NaN
29 NaN NaN NaN NaN
30 2.0 0.0 100.0 217692.0
31 2.0 0.0 200.0 192709.0
32 2.0 0.0 300.0 44718.0
33 NaN NaN NaN NaN
34 NaN NaN NaN NaN
35 NaN NaN NaN NaN
36 2.0 2.0 100.0 144522.0
37 2.0 2.0 200.0 184146.0
38 2.0 2.0 300.0 58155.0
39 2.0 3.0 100.0 130899.0
40 2.0 3.0 200.0 177347.0
41 2.0 3.0 300.0 206209.0
42 2.0 4.0 100.0 201781.0
43 2.0 4.0 200.0 58077.0
44 2.0 4.0 300.0 218298.0
45 NaN NaN NaN NaN
46 NaN NaN NaN NaN
47 NaN NaN NaN NaN
48 NaN NaN NaN NaN
49 NaN NaN NaN NaN
50 NaN NaN NaN NaN
51 NaN NaN NaN NaN
52 NaN NaN NaN NaN
53 NaN NaN NaN NaN
54 NaN NaN NaN NaN
55 NaN NaN NaN NaN
56 NaN NaN NaN NaN
57 NaN NaN NaN NaN
58 NaN NaN NaN NaN
59 NaN NaN NaN NaN
Tried:
rm_dies = b100["Die"].to_list()
rm_cells = b100["Cell"].to_list()
for die, cell in zip(rm_dies, rm_cells):
df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
This works, but is very slow on large dataframes and is not elegant.
Die Cell Current Resistance
0 1.0 0.0 100.0 104155.0
1 1.0 0.0 200.0 2542.0
2 1.0 0.0 300.0 206302.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 1.0 2.0 100.0 141502.0
7 1.0 2.0 200.0 40422.0
8 1.0 2.0 300.0 31066.0
9 1.0 3.0 100.0 108234.0
10 1.0 3.0 200.0 140520.0
11 1.0 3.0 300.0 43586.0
12 1.0 4.0 100.0 138305.0
13 1.0 4.0 200.0 88725.0
14 1.0 4.0 300.0 239517.0
15 NaN NaN NaN NaN
16 NaN NaN NaN NaN
17 NaN NaN NaN NaN
18 1.0 6.0 100.0 125984.0
19 1.0 6.0 200.0 37036.0
20 1.0 6.0 300.0 179742.0
21 NaN NaN NaN NaN
22 NaN NaN NaN NaN
23 NaN NaN NaN NaN
24 NaN NaN NaN NaN
25 NaN NaN NaN NaN
26 NaN NaN NaN NaN
27 1.0 9.0 100.0 186133.0
28 1.0 9.0 200.0 863.0
29 1.0 9.0 300.0 235060.0
30 2.0 0.0 100.0 217692.0
31 2.0 0.0 200.0 192709.0
32 2.0 0.0 300.0 44718.0
33 NaN NaN NaN NaN
34 NaN NaN NaN NaN
35 NaN NaN NaN NaN
36 2.0 2.0 100.0 144522.0
37 2.0 2.0 200.0 184146.0
38 2.0 2.0 300.0 58155.0
39 2.0 3.0 100.0 130899.0
40 2.0 3.0 200.0 177347.0
41 2.0 3.0 300.0 206209.0
42 2.0 4.0 100.0 201781.0
43 2.0 4.0 200.0 58077.0
44 2.0 4.0 300.0 218298.0
45 NaN NaN NaN NaN
46 NaN NaN NaN NaN
47 NaN NaN NaN NaN
48 NaN NaN NaN NaN
49 NaN NaN NaN NaN
50 NaN NaN NaN NaN
51 2.0 7.0 100.0 239666.0
52 2.0 7.0 200.0 10384.0
53 2.0 7.0 300.0 41034.0
54 2.0 8.0 100.0 245824.0
55 2.0 8.0 200.0 208052.0
56 2.0 8.0 300.0 37568.0
57 NaN NaN NaN NaN
58 NaN NaN NaN NaN
59 NaN NaN NaN NaN
Notes
Ultimately I want to remove these cells from the dataframe. dropna()
excluded from above calls for illustrative purposes.
3 answers
This is solved in 2 steps:
- Find rows matching remove conditions
- Do anti-left-join on the the composite key
(Die, Cell)
To filter out the rows:
# Read in the data
df = pd.read_csv("data.csv")
# Identify cells with low current
b100 = df[(df["Resistance"] < 100000) & (df["Current"] == 100)]
# Pull out only columns of interest
bad_cells = b100[["Die", "Cell"]]
We can take unique values of bad_cells
here, but I didn't, because it doesn't affect the outcome in the end. Also, taking the column subset is not necessary, but reduces extra junk columns later.
Pandas has no explicit anti-join ("join on doesn't equal") so I stole it from https://stackoverflow.com/a/55543744/21703684:
# Remove them with anti join
outer_join = df.merge(bad_cells, how="left", indicator=True)
filtered = outer_join[outer_join._merge != "both"].drop("_merge", axis=1)
print(filtered)
0 comment threads
The following users marked this post as Works for me:
User | Comment | Date |
---|---|---|
mcp | (no comment) | Jul 4, 2023 at 22:54 |
Answer
After re-encountering this problem and searching for a solution with a less generic "pandas check if column pair is found in other df" (forgetting I had previously encountered this problem altogether), I found some inspiration that led me to a solid solution.
It is fastest, and most correct to form a column containing tuple values
of the value pair in question, and use isin()
to find the matches.
This column can be later dropped after use.
df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"]))
Die Cell Current Resistance Die, Cell
0 1.0 0.0 100.0 104155.0 (1, 0)
1 1.0 0.0 200.0 2542.0 (1, 0)
2 1.0 0.0 300.0 206302.0 (1, 0)
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 1.0 2.0 100.0 141502.0 (1, 2)
7 1.0 2.0 200.0 40422.0 (1, 2)
8 1.0 2.0 300.0 31066.0 (1, 2)
9 1.0 3.0 100.0 108234.0 (1, 3)
10 1.0 3.0 200.0 140520.0 (1, 3)
11 1.0 3.0 300.0 43586.0 (1, 3)
12 1.0 4.0 100.0 138305.0 (1, 4)
13 1.0 4.0 200.0 88725.0 (1, 4)
14 1.0 4.0 300.0 239517.0 (1, 4)
15 NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN
18 1.0 6.0 100.0 125984.0 (1, 6)
19 1.0 6.0 200.0 37036.0 (1, 6)
20 1.0 6.0 300.0 179742.0 (1, 6)
21 NaN NaN NaN NaN NaN
22 NaN NaN NaN NaN NaN
23 NaN NaN NaN NaN NaN
24 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN
27 1.0 9.0 100.0 186133.0 (1, 9)
28 1.0 9.0 200.0 863.0 (1, 9)
29 1.0 9.0 300.0 235060.0 (1, 9)
30 2.0 0.0 100.0 217692.0 (2, 0)
31 2.0 0.0 200.0 192709.0 (2, 0)
32 2.0 0.0 300.0 44718.0 (2, 0)
33 NaN NaN NaN NaN NaN
34 NaN NaN NaN NaN NaN
35 NaN NaN NaN NaN NaN
36 2.0 2.0 100.0 144522.0 (2, 2)
37 2.0 2.0 200.0 184146.0 (2, 2)
38 2.0 2.0 300.0 58155.0 (2, 2)
39 2.0 3.0 100.0 130899.0 (2, 3)
40 2.0 3.0 200.0 177347.0 (2, 3)
41 2.0 3.0 300.0 206209.0 (2, 3)
42 2.0 4.0 100.0 201781.0 (2, 4)
43 2.0 4.0 200.0 58077.0 (2, 4)
44 2.0 4.0 300.0 218298.0 (2, 4)
45 NaN NaN NaN NaN NaN
46 NaN NaN NaN NaN NaN
47 NaN NaN NaN NaN NaN
48 NaN NaN NaN NaN NaN
49 NaN NaN NaN NaN NaN
50 NaN NaN NaN NaN NaN
51 2.0 7.0 100.0 239666.0 (2, 7)
52 2.0 7.0 200.0 10384.0 (2, 7)
53 2.0 7.0 300.0 41034.0 (2, 7)
54 2.0 8.0 100.0 245824.0 (2, 8)
55 2.0 8.0 200.0 208052.0 (2, 8)
56 2.0 8.0 300.0 37568.0 (2, 8)
57 NaN NaN NaN NaN NaN
58 NaN NaN NaN NaN NaN
59 NaN NaN NaN NaN NaN
Timings
I determined it to be fastest by testing the following methods:
import time
import random
import numpy as np
import pandas as pd
from itertools import product
random.seed(12345)
# dies = [1, 2]
# cells = list(range(10))
# currents = [100, 200, 300]
dies = list(range(10))
cells = list(range(1000))
currents = list(range(100, 800 + 100, 100))
dcc = list(product(dies, cells, currents))
resistances = random.choices(range(250000 + 1), k=len(dcc))
df = pd.DataFrame(dcc, columns=["Die", "Cell", "Current"])
df["Resistance"] = resistances
b100 = df[(df["Current"] == 100) & (df["Resistance"] < 100000)]
print(f"df:\n{df}\n")
print(f"b100:\n{b100}\n")
# --
df_og = df.copy()
b100_og = b100.copy()
# --
start = time.perf_counter()
rm_dies = b100["Die"].to_list()
rm_cells = b100["Cell"].to_list()
for die, cell in zip(rm_dies, rm_cells):
df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
stop = time.perf_counter()
loop_time = stop - start
# --
df = df_og.copy()
b100 = b100_og.copy()
start = time.perf_counter()
b100 = b100.drop_duplicates(["Die", "Cell"])
rm_dies = b100["Die"].to_list()
rm_cells = b100["Cell"].to_list()
for die, cell in zip(rm_dies, rm_cells):
df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
stop = time.perf_counter()
loop_no_dup_time = stop - start
# --
df = df_og.copy()
b100 = b100_og.copy()
start = time.perf_counter()
b100 = b100.drop_duplicates(["Die", "Cell"])
for die, cell in zip(b100["Die"], b100["Cell"]):
df = df.mask((df["Die"] == die) & (df["Cell"] == cell))
stop = time.perf_counter()
loop_no_dup_zip_time = stop - start
# --
# INCORRECT
df = df_og.copy()
b100 = b100_og.copy()
start = time.perf_counter()
b100 = b100.drop_duplicates(["Die", "Cell"])
df = df.set_index(["Die", "Cell", "Current"])
df.loc[(b100["Die"], b100["Cell"], slice(None)), :] = np.nan
stop = time.perf_counter()
slice_time = stop - start
# INCORRECT
# --
df = df_og.copy()
b100 = b100_og.copy()
start = time.perf_counter()
b100 = b100.drop_duplicates(["Die", "Cell"])
df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
df = df.mask(df["Die, Cell"].isin(b100["Die, Cell"]))
stop = time.perf_counter()
tuple_time = stop - start
# --
df = df_og.copy()
b100 = b100_og.copy()
start = time.perf_counter()
b100 = b100.drop_duplicates(["Die", "Cell"])
df["Die, Cell"] = list(zip(df["Die"], df["Cell"]))
b100["Die, Cell"] = list(zip(b100["Die"], b100["Cell"]))
df = df.mask(
df[["Die", "Cell"]].apply(tuple, axis="columns").isin(
b100[["Die", "Cell"]].apply(tuple, axis="columns")))
stop = time.perf_counter()
tuple_fly_time = stop - start
# --
print(f"Loop time: {loop_time:.2f} seconds")
print(f"Loop no dup time: {loop_no_dup_time:.2f} seconds")
print(f"Loop no dup zip time: {loop_no_dup_zip_time:.2f} seconds")
print(f"Slice time: {slice_time:.2f} seconds")
print(f"Tuple time: {tuple_time:.2f} seconds")
print(f"Tuple fly time: {tuple_fly_time:.2f} seconds")
Loop time: 14.01 seconds
Loop no dup time: 14.66 seconds
Loop no dup zip time: 16.66 seconds
Slice time: 0.16 seconds
Tuple time: 0.03 seconds
Tuple fly time: 0.28 seconds
Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
Notes
Slicing is a close second on speed, but produces an incorrect result. Same behavior as (mentioned in Tried):
df = df.mask((df["Die"].isin(b100["Die"])) & (df["Cell"].isin(b100["Cell"])))
The use of drop_duplicates()
is technically not needed, since b100
is created at a single current, with no duplicate die, cell pairs.
Despite this, it is left in place for my own convenience.
Inspired by:
https://stackoverflow.com/a/16068497
https://stackoverflow.com/a/70355304
0 comment threads
You were on the right path breaking it down into two steps. First identify the cells you want to remove:
# Identify the 'Cell' values that meet the criteria
cells_to_remove = df[(df['Resistance'] < 100000) & (df['Current'] == 100)]['Cell'].unique()
And then remove them:
# Remove all rows with 'Cell' values that were identified
filtered_df = df[~df['Cell'].isin(cells_to_remove)]
0 comment threads