AI-Based Data Transformation: A Comparison of LLM-Generated PySpark Code (Using Mistral & Google Gemini Advanced)

AI-Based Data Transformation: A Comparison of LLM-Generated PySpark Code (Using Mistral & Google Gemini Advanced)

Mistral's Le Chat didn't have the functionality to upload CSV files, so I couldn't generate a transformation code.

I could have built a lightweight ingestion process using Mistral API and Langchain. However, that was out of scope for the context of this blog post.

Next, we move on to Google's Large Language Models. Google's Gemini didn't allow CSV file upload, so I had to upgrade my Google Gemini account to an Advanced subscription. Good thing they gave me a 1-month free trial to test this out. Otherwise, I was looking at spending $19.99 per month subscription.

We used the same 3 CSV file formats below to generate code for processing ERCOT's "2-Day Ancillary Services Reports."

  • NP3-959-EX: 48-hour Aggregate AS Offers

  • NP3-960-EX: 48-hour Self-Arranged AS

  • NP3-961-EX: 48-hour Cleared DAM AS

The task is to ingest and transform this data using Azure Synapse Analytics and PySpark to derive meaningful insights.

We have a folder with Ercot's "2-Day Ancillary Services Reports" This report contains all 48 Hour disclosure data related to DAM. The following individual files are included in the report:NP3-959-EX 48-hour Aggregate AS Offers; NP3-960-EX 48-hour Self-Arranged AS; NP3-961-EX 48-hour Cleared DAM AS (previously named 48 Hour Ancillary Services Reports).

We gathered daily data as a zipped file containing multiple CSVs inside the Zipped file for the month of April. I have added all the zipped files in a master folder. So now there is a Master folder with a 30 days zipped file. Each zipped file has multiple CSV with data listed above.

Can you give me step by step procedure to do the following:

1) Ingest the CSV file within zipped folders using PySpark

2) filter outNP3-960-EX 48-hour Self-Arranged ASandNP3-961-EX 48-hour Cleared DAM AS

3) Ingest the file in the SQL database that I recently set up on my Azure.

Google Gemini Advanced failed to perform a simple data transformation coding task that was easily done by GPT-4, Claude 3.5, and Llama-3-70B.

Here is the result that I received from the prompt:

It seems that the PySpark library is not available. We will use pandas to read the csv files and then convert the dataframes to spark dataframes. We can then filter out the dataframes for 2d_Self_Arranged_AS_ECRSM-01-APR-24.csv and 2d_Cleared_DAM_AS_ECRSM-01-APR-24.csv and proceed.

Given that the files 2d_Agg_AS_Offers_ECRSM-01-APR-24.csv, 2d_Cleared_DAM_AS_ECRSM-01-APR-24.csv, and 2d_Self_Arranged_AS_ECRSM-01-APR-24.csv are present in the current directory, we will extract and read these directly.

It seems that the PySpark library is not available. We will use pandas to read the csv files and filter out the dataframes for 2d_Self_Arranged_AS_ECRSM-01-APR-24.csv and 2d_Cleared_DAM_AS_ECRSM-01-APR-24.csv. Since we cannot use PySpark to ingest the data into a SQL database, we will provide general instructions on how to achieve this using common libraries and approaches.

Given that the files 2d_Agg_AS_Offers_ECRSM-01-APR-24.csv, 2d_Cleared_DAM_AS_ECRSM-01-APR-24.csv, and 2d_Self_Arranged_AS_ECRSM-01-APR-24.csv are present in the current directory, we will extract and read these directly.

There appears to be no folder named 'Master_Folder' in the current directory. Please ensure the folder exists and re-run the code.

To ingest the filtered data into a SQL database (since PySpark is not available), you can use libraries like pandas and sqlalchemy. Here's a general approach:

Establish a Connection:

  • Use sqlalchemy to create an engine specifying your database credentials and type (e.g., 'mssql' for Azure SQL).

Define Table Schema (If Not Exists):

  • If the table doesn't exist in your database, define its schema (column names and data types) based on your DataFrame.

Ingest Data:

  • Use the df.to_sql() method in pandas to write the DataFrame into the SQL table. Specify the table name, engine, and optionally, how to handle existing data (append, replace, etc.).

Example Code Snippet:

from sqlalchemy import create_engine

# ... (previous code to read and filter DataFrames)

# Establish connection (replace placeholders with your credentials)
engine = create_engine('mssql+pyodbc://your_username:your_password@your_server_name/your_database_name?driver=ODBC+Driver+17+for+SQL+Server')

# Ingest data (replace 'your_table_name' with the desired table name)
for df in filtered_dfs:
    df.to_sql('your_table_name', engine, if_exists='append', index=False)

Key Points:

  • Replace placeholders with your actual Azure SQL database credentials.

  • Install required libraries (sqlalchemy, pyodbc) if not already installed.

  • Adjust the if_exists parameter in to_sql() based on your preference (e.g., 'replace' to overwrite existing data).

  • Consider batching data insertion for large datasets to optimize performance.

Google's Gemini Advanced truly disappointed me with their results. While all other models gave the proper instructions with code, albeit, with some errors, Google failed to give the results.

I understand that they might not have implemented the Spark library installer in their compiler. Plus code generation might not be Google Gemini Advanced's key use case, but Google has been disappointing us with its AI over the months.

In a nutshell, I am canceling my $19.99 per month subscription.