Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Comparing two excel files with Python based on changes
I have two tables:
Table1:
Name | Description | Amount |
---|---|---|
123 | Description123 | 123 |
456 | Description456 | 456 |
789 | Description789 | 666 |
101 | Description777 | 101 |
133 | Description133 | 133 |
Table2:
Name | Description | Amount |
---|---|---|
456 | Description456 | 456 |
789 | Description789 | 789 |
101 | Description101 | 101 |
123 | Description123 | 123 |
102 | Description102 | 102 |
I need to find the difference in Table1 compared it from Table2. The connection between these 2 excel files will be the column Name.
Expected output is if something is changed in Table 2 the data must be used from Table 2 and if there is new rows from Table 2 they must be added to the final result. If nothing is also changed or Table 2 doesn't have any data for specific Name from Table 1 like 133 the rows also need to be added to the final result.
Expected output:
Name | Description | Amount |
---|---|---|
123 | Description123 | 123 |
456 | Description456 | 456 |
789 | Description789 | 789 |
101 | Description101 | 101 |
102 | Description102 | 102 |
133 | Description133 | 133 |
Thanks in advance!
Edit1: I struggle to find the solution. I understand how to compare each rows in the excel files, but they need to have exactly the same order in Name column. I don't know how to do it if there is no order like this specific case above.
1 answer
Here's what I'd do, hope it still helps someone:
import pandas as pd
t1 = [123,456,789,101,133]
t1_descr = ['Description' + str(i) for i in t1]
table1 = pd.DataFrame({'name': t1, 'description': t1_descr, 'amount': [123,456,666,101,133]})
t2 = [456,789,101,123,102]
t2_descr = ['Description' + str(i) for i in t2]
table2 = pd.DataFrame({'name': t2, 'description': t2_descr, 'amount': t2})
df = table1.merge(table2, on=['name'], how='outer', suffixes=('_t1', '_t2'), indicator=True)
- | name | description_t1 | amount_t1 | description_t2 | amount_t2 | _merge |
---|---|---|---|---|---|---|
0 | 123 | Description123 | 123.0 | Description123 | 123.0 | both |
1 | 456 | Description456 | 456.0 | Description456 | 456.0 | both |
2 | 789 | Description789 | 666.0 | Description789 | 789.0 | both |
3 | 101 | Description101 | 101.0 | Description101 | 101.0 | both |
4 | 133 | Description133 | 133.0 | NaN | NaN | left_only |
5 | 102 | NaN | NaN | Description102 | 102.0 |
# If `name` is on both tables, use table2
df2 = df.copy()
df2.loc[df2._merge=='both', 'description'] = df2.loc[df2._merge=='both', 'description_t2']
df2.loc[df2._merge=='both', 'amount'] = df2.loc[df2._merge=='both', 'amount_t2']
# New rows on table2
df2.loc[df2._merge=='right_only', 'description'] = df2.loc[df2._merge=='right_only', 'description_t2']
df2.loc[df2._merge=='right_only', 'amount'] = df2.loc[df2._merge=='right_only', 'amount_t2']
# If `name` not in table2, use table1
df2.loc[df2._merge=='left_only', 'description'] = df2.loc[df2._merge=='left_only', 'description_t1']
df2.loc[df2._merge=='left_only', 'amount'] = df2.loc[df2._merge=='left_only', 'amount_t1']
df2.drop(columns=['description_t1', 'amount_t1', 'description_t2', 'amount_t2', '_merge'])
name | description | amount | |
---|---|---|---|
0 | 123 | Description123 | 123.0 |
1 | 456 | Description456 | 456.0 |
2 | 789 | Description789 | 789.0 |
3 | 101 | Description101 | 101.0 |
4 | 133 | Description133 | 133.0 |
5 | 102 | Description102 | 102.0 |
2 comment threads