In modern data pipelines, data often comes in nested JSON or XML formats. These formats are flexible, allowing hierarchical relationships and arrays, but they can be challenging to handle with traditional relational approaches. Flattening this data efficiently is critical for analytics, reporting, or loading into data warehouses like Snowflake. nested JSON or XML formats analytics, reporting, or loading into data warehouses like Snowflake In this blog, we explore a dynamic, recursive approach to parsing nested JSON and XML in Spark using a reusable function, flatten_df_recursive. dynamic, recursive approach to parsing nested JSON and XML in Spark flatten_df_recursive Real-World Scenario Real-World Scenario Imagine working at an e-commerce company: e-commerce company Customer Data is stored in JSON files with nested structures for addresses, orders, and payment info.
Order Details may contain arrays of items, each with nested product info.
Legacy Partners sends XML files for inventory and shipment updates. Customer Data is stored in JSON files with nested structures for addresses, orders, and payment info. Customer Data Order Details may contain arrays of items, each with nested product info. Order Details Legacy Partners sends XML files for inventory and shipment updates. Legacy Partners Your goal: Your goal: Load this data into Databricks/Spark.
Flatten all nested structures dynamically (without hardcoding column names).
Save the flattened output for analytics, ML, or reporting. Load this data into Databricks/Spark. Flatten all nested structures dynamically (without hardcoding column names). Save the flattened output for analytics, ML, or reporting. Challenges: Challenges: Unknown nesting levels.
Arrays with nested structs.
Frequent schema changes. Unknown nesting levels. Arrays with nested structs. Frequent schema changes. This is exactly where recursive flattening comes in handy. recursive flattening The Recursive Flatten Function The Recursive Flatten Function Here’s the core function: from pyspark.sql.types import StructType, ArrayType
from pyspark.sql.functions import col, explode_outer

def flatten_df_recursive(df):
    """
    Recursively flattens all nested StructType and ArrayType columns in a Spark DataFrame.
    Supports multiple nested levels for JSON/XML data.
    """
    # Track complex fields (StructType or ArrayType)
    complex_fields = [(field.name, field.dataType) 
                      for field in df.schema.fields 
                      if isinstance(field.dataType, (StructType, ArrayType))]

    while complex_fields:
        col_name, col_type = complex_fields.pop(0)
        
        # If StructType, expand its fields with aliases
        if isinstance(col_type, StructType):
            expanded_cols = [
                col(f"{col_name}.{nested_field.name}").alias(f"{col_name}_{nested_field.name}")
                for nested_field in col_type.fields
            ]
            df = df.select("*", *expanded_cols).drop(col_name)
        
        # If ArrayType, explode the array
        elif isinstance(col_type, ArrayType):
            df = df.withColumn(col_name, explode_outer(col(col_name)))

        # Refresh the complex fields list after modifications
        complex_fields = [(field.name, field.dataType) 
                          for field in df.schema.fields 
                          if isinstance(field.dataType, (StructType, ArrayType))]
    
    return df from pyspark.sql.types import StructType, ArrayType
from pyspark.sql.functions import col, explode_outer

def flatten_df_recursive(df):
    """
    Recursively flattens all nested StructType and ArrayType columns in a Spark DataFrame.
    Supports multiple nested levels for JSON/XML data.
    """
    # Track complex fields (StructType or ArrayType)
    complex_fields = [(field.name, field.dataType) 
                      for field in df.schema.fields 
                      if isinstance(field.dataType, (StructType, ArrayType))]

    while complex_fields:
        col_name, col_type = complex_fields.pop(0)
        
        # If StructType, expand its fields with aliases
        if isinstance(col_type, StructType):
            expanded_cols = [
                col(f"{col_name}.{nested_field.name}").alias(f"{col_name}_{nested_field.name}")
                for nested_field in col_type.fields
            ]
            df = df.select("*", *expanded_cols).drop(col_name)
        
        # If ArrayType, explode the array
        elif isinstance(col_type, ArrayType):
            df = df.withColumn(col_name, explode_outer(col(col_name)))

        # Refresh the complex fields list after modifications
        complex_fields = [(field.name, field.dataType) 
                          for field in df.schema.fields 
                          if isinstance(field.dataType, (StructType, ArrayType))]
    
    return df Key Features: Key Features: Fully dynamic: Handles any number of nested struct or array levels.
No hardcoding: Works even if new fields are added later.
Array support: Automatically explodes arrays to rows.
XML & JSON friendly: Works with any hierarchical data loaded into Spark. Fully dynamic: Handles any number of nested struct or array levels. Fully dynamic: No hardcoding: Works even if new fields are added later. No hardcoding: Array support: Automatically explodes arrays to rows. Array support: XML & JSON friendly: Works with any hierarchical data loaded into Spark. XML & JSON friendly: Reading Nested JSON in Spark Reading Nested JSON in Spark Read the JSON file with multiline df = spark.read.option("multiline", "true").json("dbfs:/FileStore/temp/orders.json")
display(df) df = spark.read.option("multiline", "true").json("dbfs:/FileStore/temp/orders.json")
display(df) Passing the JSON Data Frame into flatten_df_recursive will flatten all nested structs and arrays, making the data ready for analytics or reporting. flatten_df_recursive flatten all nested structs and arrays flat_df = flatten_df_recursive(df)
display(flat_df) flat_df = flatten_df_recursive(df)
display(flat_df) Reading Nested XML in Spark Reading Nested XML in Spark xml_path = "/dbfs/data/nested_orders.xml"

df = spark.read.format("xml") \
       .option("rowTag", "order") \
       .load(xml_path)

flat_df = flatten_df_recursive(df)
flat_df.show(truncate=False) xml_path = "/dbfs/data/nested_orders.xml"

df = spark.read.format("xml") \
       .option("rowTag", "order") \
       .load(xml_path)

flat_df = flatten_df_recursive(df)
flat_df.show(truncate=False) “rowTag” indicates the repeating XML element representing one record.
Nested elements and arrays are automatically handled by “flatten_df_recursive”. “rowTag” indicates the repeating XML element representing one record. “rowTag” Nested elements and arrays are automatically handled by “flatten_df_recursive”. flatten_df_recursive” Why This Approach Matters Why This Approach Matters Dynamic Schema Handling: No need to rewrite flattening logic if JSON/XML structure changes.
Supports Multi-Level Nesting: Works for deeply nested structs and arrays.
Scalable: Can process large files on Spark without loading everything in memory.
Reusable: Works for any source — JSON, XML, Snowflake, Delta, or Parquet with nested structures. Dynamic Schema Handling: No need to rewrite flattening logic if JSON/XML structure changes. Supports Multi-Level Nesting: Works for deeply nested structs and arrays. Scalable: Can process large files on Spark without loading everything in memory. Reusable: Works for any source — JSON, XML, Snowflake, Delta, or Parquet with nested structures. Summary Summary Nested JSON and XML are common in modern data pipelines, but traditional flattening approaches fail with deep nesting or schema changes. By using a recursive, dynamic Spark flattening function, you can: recursive, dynamic Spark flattening function Automatically flatten any depth of nesting.
Handle arrays and structs seamlessly.
Integrate with analytics, ML pipelines, and warehouses like Snowflake. Automatically flatten any depth of nesting. any depth of nesting Handle arrays and structs seamlessly. Integrate with analytics, ML pipelines, and warehouses like Snowflake. This approach is especially useful for ETL pipelines, data lake ingestion, and reporting systems where data structure evolves frequently. ETL pipelines data lake ingestion reporting systems Here I have attached the entire report for your reference. Notebook Report

This story contains AI-generated text. The author has used AI either for research, to generate outlines, or write the text itself. 

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

How to Flatten Nested JSON and XML in Apache Spark

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Designing Reliable API Systems: Exception Handling with Spring Boot’s ControllerAdvice

3 Best Hadoop Alternatives to Consider for Migration

The Noonification: Introduction to Python Debugging with Pdb (8/25/2022)

8 Lessons For Building Data Companies On Solid Ground

Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution

Accelerating Write-Intensive Data Workloads on AWS S3

Designing Reliable API Systems: Exception Handling with Spring Boot’s ControllerAdvice

3 Best Hadoop Alternatives to Consider for Migration

The Noonification: Introduction to Python Debugging with Pdb (8/25/2022)

8 Lessons For Building Data Companies On Solid Ground

Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution

Accelerating Write-Intensive Data Workloads on AWS S3

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps