Oct 16, 2025

The Nu Blog

Building Iceberg Lakehouse with AWS Glue

by Mateusz Słabek

A modern lakehouse helps teams move beyond raw storage in data lakes by structuring data for analysis and decision-making. While data warehouse keeps structured, analytics-ready data and data lake stores raw data for schema-on-read access and exploratory use cases, lakehouse combines these paradigms in one unified platform. Lakehouse built on table formats such as Apache Iceberg enables ACID transactions and schema evolution on low-cost object storage.

Medallion architecture is a popular design pattern for building a lakehouse by organising data into 3 layers: Bronze, Silver, and Gold. In this article, we show how you can leverage Nussknacker Cloud to implement this architecture on AWS S3 using Apache Iceberg and AWS Glue Catalog. You can check out the Nussknacker scenario in the live demo or follow along to build it yourself at Nussknacker Cloud.

Medallion architecture

The Medallion architecture is a way of structuring data in a lakehouse so that it becomes progressively cleaner, more reliable, and more valuable as it moves through the pipeline. It is built around three main layers:

Bronze layer - the entry point of the data lake. It stores raw data exactly as ingested, often straight from streaming platforms or batch systems. The focus here is on completeness and fidelity rather than quality.
Silver layer - the enriched, conformed layer. Data in Bronze is cleaned, validated, and enriched with relevant data (for example, dimension tables from relational databases). The Silver layer represents standardized data, ready to be used as a source for analysis by Data Scientists.
Gold layer - the business-level layer. Data from the Silver layer is aggregated, summarized, or modeled to answer specific business questions. Gold datasets are typically consumed by BI tools or ML pipelines.

A diagram of the Medallion Architecture in a Data Lakehouse with three layers: Bronze (Raw data), Silver (Cleaned), and Gold (Business-level data) — The Medallion Architecture organizes and refines all enterprise data, ensuring analysts and ML models always use trusted, business-aligned data.

As with many architectural patterns, there are multiple interpretations of the specifics. The silver layer is sometimes limited to mostly light changes, not encompassing enrichments, but for the sake of simplicity, we’ll stick with the above definitions.

This layered approach brings much-needed order to a data lake. It’s simple, easy to understand, and supports built-in recovery - downstream tables can always be recreated from the data in the Bronze layer.

Lakehouse storage with AWS Glue and Iceberg

A lakehouse needs storage that can handle large volumes of diverse data - from raw event streams to curated analytics tables - without sacrificing scalability or cost efficiency. Traditional databases aren’t designed for this kind of workload, which is why most lakehouses rely on object storage such as Amazon S3. It provides durable, virtually unlimited storage at a low cost.

However, object storage is not enough. For a structured approach of a lakehouse, we’d preferably want some mechanism to discover and manage the structure of the datasets. For this purpose, AWS provides AWS Glue Catalog – a centralized metadata store that manages schemas and simplifies integrations with processing engines.

Still, even with a metadata store, we may miss some of the features of good old relational databases, like ACID transactions. This is where table formats come in. Apache Iceberg is an open table format that integrates with AWS S3 and Glue Catalog. It provides not only ACID operations, but also other amazing features, such as schema evolution, partition evolution, or even time travel.

Sales analytics example

Let’s see how medallion architecture can be used to solve a specific problem. Suppose we are building a data pipeline for a retail store. We have two data sources:

Continuous stream of sales events, each representing a sale with the following fields:
- price
- sale_time
- product_id – id referencing a product in the database
Static data table containing product data with the following columns:
- product_id
- product_name
- product_category

Even for this simple data model, we can gain some insights by building a dataset containing aggregated sales metrics like sum, average, and count of all sales for every product category in a time window – for example, every hour. This would be business-level data that could be displayed on a BI dashboard and serve as a basis for business decisions. In a real-world scenario, we’d have more data points to build business insights, and they would be much more valuable, but for this scenario, let’s keep it simple.

Diagram representing the flow: Sales events stream is enriched by products data from a database, aggregated. At every stage, the data is written to object storage. — As raw sale events are transformed at different stages of the pipeline, their quality, structure, and accessibility for analysis consistently improve. This methodical refinement ensures that the data moves from a raw, immutable record (Bronze) to a clean, foundational source (Silver), and finally to a business-ready model (Gold).

It makes sense to publish the sales events to a streaming queue, since we can produce and process these events continuously. Product data could be placed in a relational database like PostgreSQL.

For our data lakehouse, we’d like to store the sales events as raw data in an Iceberg table as part of the Bronze layer. For each sales event, we can query the database in order to enrich it and store it in the Silver layer. After that, we can aggregate the enriched events by applying sum, average aggregate functions to price data, group by product category, and apply a 1-hour time window.

You've probably noticed the shift to more technical streaming terms like enrichment, aggregation, and time windows. This is the point where simple data processing isn't enough – the requirements demand a dedicated streaming engine like Apache Flink to manage the complexity and continuous flow. However, integrating Flink with modern table formats like Iceberg involves significant upfront costs in the form of initial configuration and management of the technical dependencies. Nussknacker Cloud effectively abstracts these details, making the deployment process simple and effortless.

Nussknacker Cloud Setup

This section is optional. Its goal is to show how to set up external services and configure a Nussknacker Cloud instance so that you can fully reproduce the scenario on your own.

Nussknacker Cloud leverages Flink to enable stateful operations, like aggregations on your data. Out of the box, it provides components for operating on data, but Sources (components for pulling data into a scenario) and Sinks (components for outputting data from a scenario) generally rely on Integrations. Integrations are available as a tab in the Nussknacker Cloud Admin Panel. When you add an integration, Nussknacker Cloud will automatically add the associated components to the palette in your Nussknacker instance.

Available Nussknacker Cloud integrations with following categories: Databases, ML, Message Brokers, AI Models — Integrating your systems with processing engines can be quite a hassle. Nussknacker Cloud provides simple ways to add integrations by abstracting away the technical issues.

For our scenario, we need to integrate with 2 external systems:

Relational Database integration, like PostgreSQL, for our static products data
AWS Glue integration for storing output data, serving as a lakehouse

PostgreSQL integration for enrichment

PostgreSQL integration requires the database to be publicly reachable and secured with a password. If you don’t already have one, you can use some free-tier managed PostgreSQL, for example, the one available at Aiven Platform.

A screenshot of the Integration parameters form for setting up an PostgreSQL database connection, including fields for URL, Username, and Password. — After filling in the values, you can verify that integration is properly configured by the 'CHECK CONNECTION' button at the bottom.

After you save the Postgres integration, your Nussknacker instance should have 5 components prefixed with your integration name. The component we will be using for enrichment is the Lookup component.

Image of component palette filtered by name of the Postgres integration, listing available components. — In Nussknacker's component palette, you can filter components by name to find them quickly. Names of components based on Nussknacker Cloud integrations are generally prefixed with your provided integration name.

The component only sees the existing tables, so before we use it, we need to create a products table and populate it. To do that, connect to your database with psql or some database admin tool like DBeaver or pgAdmin and execute the following statements.

AWS Glue integration for Iceberg lakehouse storage

AWS Glue integration allows reading from and writing to Iceberg tables registered in AWS Glue and stored in S3. An Iceberg table is essentially a dataset in an Iceberg format. This format consists of a data folder, which stores the actual data in Parquet, Avro, or ORC format, and a metadata folder, which stores data about schemas, snapshots, and partitions. The metadata layer is what enables such unique features as time travel or tracking schema evolution.

Besides the Iceberg format, the architecture of this whole integration may seem confusing, but it comes down to 4 parts:

S3 – stores data in Iceberg format
Glue – tracks Iceberg tables, allows for discovery
Lake Formation – governs access for Glue and S3 (controls what data users can query inside the lake)
IAM – authenticates users

Diagram showing Nussknacker Cloud interacting with AWS — Nussknacker seamlessly accesses Iceberg data in AWS. Table discovery is enabled by Glue, while reads & writes are conducted directly.

The integration connects to Iceberg tables governed by Lake Formation and Glue. For this, AWS Lake Formation documentation on creating Iceberg Tables may be useful. For these tables to be available in Nussknacker, they need to have a schema, and they need to be accessible by an AWS IAM user.

For our scenario, we will need to create 3 Iceberg tables:

Table name	Schema
sales_raw	[ { "Name": "sale_time", "Type": "timestamp" }, { "Name": "product_id", "Type": "string" }, { "Name": "price", "Type": "decimal(10,2)" } ]
sales_enriched	[ { "Name": "sale_time", "Type": "timestamp" }, { "Name": "product_id", "Type": "string" }, { "Name": "price", "Type": "decimal(10,2)" }, { "Name": "product_name", "Type": "string" }, { "Name": "product_category", "Type": "string" } ]
sales_aggregated	[ { "Name": "product_category", "Type": "string" }, { "Name": "total_sales", "Type": "decimal(10,2)" }, { "Name": "avg_price", "Type": "decimal(10,2)" }, { "Name": "sales_count", "Type": "bigint" }, { "Name": "window_start", "Type": "timestamp" }, { "Name": "window_end", "Type": "timestamp" } ]

Integration form for an AWS Glue integration in Nussknacker Cloud, showing Region, Access Key ID, Secret Access Key and Warehouse fields. — While adding Glue, make sure to configure the required permissions for the user behind the provided credentials.

Image of component palette filtered by name of the Glue integration, listing available components/ — AWS Glue integration provides a Source as well as a Sink.

In Nussknacker Cloud Admin Panel in the AWS Glue integration creation form, you will also need to pass Access Key credentials for an IAM user with necessary permissions. You can view them by clicking in the link in text under the credential fields in the form. After you save the AWS Glue integration, your Nussknacker instance should have 2 components prefixed with your integration name.

Nussknacker Scenario

After we’ve set up all required integrations, let’s build the scenario. It’s mostly going to mirror the architecture diagram from before:

HTTP Endpoint source for continuously ingesting sales events
Iceberg Sink for writing bronze layer raw sales records to Iceberg
Postgres lookup enricher for getting product data
Iceberg Sink for writing silver layer enriched sales records to Iceberg
Aggregate component for stateful computation of real-time sales metrics
Iceberg Sink for writing gold layer sales aggregates to Iceberg

Ingesting a stream of events over HTTP

The HTTP Endpoint component allows for continuous ingestion of events sent to an HTTP endpoint. The body of each POST request sent to a given endpoint will be ingested as an event in the scenario downstream. JSON in the data sample parameter will make referring to event fields easier by enabling suggestions in expressions. For purposes of later aggregation, event time is taken from the 'sale_time' field. For a more comprehensive exploration of event time, check out Arek Burdach's Understanding Event Time blog.

Image of HTTP Endpoint Source. It shows fields for the Endpoint URL, Content type, a data sample, and advanced settings. — While using the schema-less dynamic approach, the value for the event time parameter has to be explicitly converted to the 'Instant' type. For more about conversions, check out Nussknacker's documentation.

Writing data to Iceberg

This raw stream of sales events will be written to Iceberg as Bronze layer data. The schema of Iceberg tables expects a 'LocalDateTime' event time type. Nussknacker doesn’t automatically recognize the 'sale_time' from the incoming request JSON body as a date type, so it has to be explicitly processed. Data validation, cleaning, and parsing, such as in this situation, can be done in a separate Variable node. This will allow later reuse of this value in other downstream nodes.

Screenshot of a Variable node defining a sale time for later use in Iceberg Sinks. — A Variable component is useful not only for defining a value for later reuse, but also for containing and naming a complex operation. This way, it doesn't impact the overall readability of the scenario.

An Iceberg Sink for writing bronze layer data under 'sales_raw' table registered in AWS Glue. — Node values can be filled in different ways. Date types can be filled using a dedicated 'datetime' editor, and text can be entered either as a 'string template' or a simple SpEL 'expression'. In this scenario, it's most convenient to just use standard SpEL expression mode.

Enrichment from Database

For further processing, the stream can be split into parallelly processed branches using the Split component to process it further into Silver and Gold layers. Nussknacker provides multiple enrichment integrations. If you have an HTTP service you want to use for enrichment, you can use OpenAPI integration. For that, check out Maciek Cichanowicz’s blog “A guide to OpenAPI components”. For the current scenario, a database lookup enricher will be fitting. Lookup component based on the previously configured Postgres integration can be used to enrich sales events with product data.

SQL enricher node view with Table, Cache TTL, Key column and Key value parameters. — The SQL enricher component detects current tables and their schemas automatically.

The table created and populated in that previous section should be available in the ‘Table’ parameter. This node looks up a record from the table based on the ‘product_id’ field in the incoming event and adds it to the stream. The record is then available under the ‘product_lookup’ output variable, which can be referenced downstream.

Going along with the medallion architecture, enriched records can be written to Iceberg as part of the Silver layer.

An Iceberg Sink for writing silver layer data under 'sales_enriched' table registered in AWS Glue.

Aggregating sales metrics

An aggregate component is essential for turning the enriched sales stream into actionable metrics. Usually, extracting valuable insight from continuous, real-time streams requires taking into account the time aspect. Fortunately, Nussknacker excels in this area.

Tumbling window aggregate node with 'groupBy', 'Aggregations', 'windowLength' and 'emitWhen' parameters — The Aggregate component allows for multiple aggregations in a given time window. For more info on how time windows work in Nussknacker, see this documentation page.

The ‘Aggregate Tumbling’ component performs aggregation in time windows that don’t occur one after another with no overlap. The windows are aligned to full time units. In configuration as visible on the image with the 1-minute window length, each window begins at the 0th second of each minute and ends at the end of each minute.

Within each window, the records are grouped by ‘product_category’. Within each group, we take:

sum of prices,
average of prices,
count of all events,
first sale time.

The aggregation emits the result for each window after it closes.

All that’s left is to write this data to Iceberg into the ‘sales_aggregated’ table.

An Iceberg Sink for writing silver layer data under 'sales_aggregated' table registered in AWS Glue.

The key variable is the list of values given in the ‘groupBy’ parameter in the aggregation node. In our case, it's a one-element list, where this element is the product category, so it can be accessed through the 0’th index. The window start and end time can be easily calculated, knowing that windows are aligned to full minutes and remembering that our window length is 1 minute.

Scenario deployment and checking the results in Athena

Let’s deploy the scenario and observe the flow. After deploying the scenario, we’ll launch a script to generate and send data continuously to the source endpoint.

Animation showing how Nussknacker visualizes flow of data records on the scenario graph in real time. — You can observe how events flow through the deployed scenario at each node. This mode can be toggled at the top toolbar through the 'live data' button.

The event flow is visible in real time. Each event can be inspected in detail when entering any node.

Aggregate node with live data mode turned on, showing current input and output data records. — Live data mode inside nodes allows for record-level data inspection. This is very useful during scenario authoring and allows for quick troubleshooting.

The data written to Glue Iceberg tables can be previewed in Athena or displayed in BI tools.

AWS Athena query and results screen showing contents of the gold-layer Iceberg table. — Using Nussknacker with Iceberg on AWS allows for seamless integration with further tools. Other AWS services like Athena or QuickSight can be used to further query the data or display it on a dashboard.

For further processing, you can also ingest these tables and process them further in Nussknacker using the ‘Iceberg AWS Glue’ Source component based on the Glue integration.

Summary

Using Nussknacker Cloud with AWS Glue and Iceberg, we built a real-time pipeline that ingests, enriches, and aggregates streaming sales data directly into a lakehouse. Each operation - from data parsing and cleaning, through database enrichment, to stateful aggregation and Iceberg writes - is represented as an intuitive, configurable node.

Nussknacker models pipelines visually, focusing on data flow and transformations rather than syntax. Everything happens without a single line of code, and with built-in testing and live event inspection, mistakes are easy to spot and fix instantly. You can watch data move through the pipeline in real time, making development faster, clearer, and more interactive.

It’s streaming analytics without the complexity - turning what used to take days of coding into minutes of visual design, with Iceberg ensuring every result is stored reliably, versioned, and ready for time-travel queries.

The Nu Blog

All posts

Features

Landscape

Comparison

Case studies

By Industries

Use cases

Using Nussknacker

Discover