The Agony of Manual Wide-Table DML Operations
Ever stared at a Snowflake table with 100+ columns and felt your soul leave your body at the thought of manually typing them all in a MERGE
statement? Yeah, me too.
In this blog we’ll discuss on how to dynamically handle the columns. I dare say it’s a framework.
Objects like Salesforce—when integrated with Snowflake—often have 100+ columns (e.g., Account, Opportunity, etc.). If you need to implement say: SCD-1, where you must use every column in insert/update pairs, it’s tedious.
Imagine this:
MERGE INTO target_table tgt
USING source_table src
ON tgt.id = src.id
WHEN MATCHED THEN
UPDATE SET
tgt.col1 = src.col1,
tgt.col2 = src.col2,
-- ... (100 more lines of this nonsense)
tgt.col100 = src.col100
WHEN NOT MATCHED THEN
INSERT (col1, col2, ..., col100)
VALUES (src.col1, src.col2, ..., src.col100);
Not only is this excruciatingly painful to do manually, also there’s a risk:
- Manual Error? Oh yes!
- Table Structure changed? Good luck updating those columns
- Bored to death? Of course! There must be a way to fetch columns dynamically, so let’s dive in.
Framework Foundations
Here’s the approach which should be followed:
- Query
INFORMATION_SCHEMA
for Column Names
sql00 = '''select column_name
from information_schema.columns
where table_name = 'OPPORTUNITY'
and table_schema = 'RAW'
order by ordinal_position;'''
sql01 = snowpark_session.sql(sql00).collect() ## executing the above sql query
- Dynamically build the
MERGE
statement using Python
columns = [col[0] for col in sql01] # Get column names
update_pairs = ",".join([f"tgt.{col} = src.{col}" for col in columns])
insert_pairs = ",".join(columns) # INSERT columns
insert_values = ",".join([f"src.{col}" for col in columns])
As we get all the required columns and we store it in columns
variable, using columns
variable we’ll build update_pairs
, insert_pairs
and insert_values
as all of it is necessary for the MERGE
statement.
This block will be very helpful if there are any derived columns like creating
HASH
value from the source table.
- Execute dynamic
MERGE
target_table = 'TRANSFORMATION.OPPORTUNITY'
source_table = 'TRANSFORMATION.ST_OPPORTUNITY'
sql02 = f'''MERGE INTO {target_table} tgt
USING (select * from {source_table} where metadata$action ='INSERT') src
ON tgt.id = src.id
WHEN MATCHED THEN
UPDATE SET {update_pairs}
WHEN NOT MATCHED THEN
INSERT ({insert_pairs}) VALUES ({insert_values});'''
res_df = snowpark_session.sql(sql02).collect()
It’s better to store source and target table in variables than using it directly.
Real-World Example: Salesforce Opportunity Object
To relate with real time scenario I’ve used Opportunity object which has 100+ columns and it’s mostly used when there’s an integration from Salesforce to Snowflake.
Here are columns pulled from INFORMATION_SCHEMA
:
Total columns: 154
Demo
Sample Opportunity Data:
No records in target table:
Executing the stored procedure
Stream creation for CDC
create or replace stream TRANSFORMATION.ST_OPPORTUNITY
on table raw.OPPORTUNITY
show_initial_rows=true;
Post Execution result
Updated 1 record to validate:
Executing the procedure again, record got updated:
Data in target table:
Why this is helpful?
- No manual column listing → Zero typos
- Automatically adapts if new columns are added
- Saves hours of boring, repetitive work
- Clean, maintainable code (future you will thank you)
Few things that should be considered:
- Exclude certain columns (like audit fields) by filtering the
INFORMATION_SCHEMA
query. - Dynamic column handling can be used in most of the DML operations. Try it!
- Add dry-run mode to test before executing.
- Log everything (because debugging blind is painful).
Try it out! And if you’ve got a cooler way to do this, hit me up—I love lazy (efficient) solutions.