Organizations are rapidly expanding their digital presence, creating opportunities to serve customers better through web applications. AWS WAF logs play a vital role in this expansion by enabling organizations to proactively monitor security, enforce compliance, and strengthen application defense. AWS WAF log analysis is essential across many industries, including banking, retail, and healthcare, each needing to deliver secure digital experiences.
To optimize their security operations, organizations are adopting modern approaches that combine real-time monitoring with scalable data analytics. They are using data lake architectures and Apache Iceberg to efficiently process large volumes of security data while minimizing operational overhead. Apache Iceberg combines enterprise reliability with SQL simplicity when working with security data stored in Amazon Simple Storage Service (Amazon S3), enabling organizations to focus on security insights rather than infrastructure management.
Apache Iceberg enhances security analytics through several key capabilities. It seamlessly integrates with various AWS services and analysis tools while supporting concurrent read-write operations for simultaneous log ingestion and analysis. Its time travel feature enables thorough security forensics and incident investigation, and its schema evolution support allows teams to adapt to emerging security patterns without disrupting existing workflows. These capabilities make Apache Iceberg an ideal choice for building robust security analytics solutions. However, organizations often struggle when building their own solutions to deliver data to Apache Iceberg tables. These include managing complex extract, transform, and load (ETL) processes, handling schema validation, providing reliable delivery, and maintaining custom code for data transformations. Teams must also build resilient error handling, implement retry logic, and manage scaling infrastructure—all while maintaining data consistency and high availability. These challenges take valuable time away from analyzing security data and deriving insights.
To address these challenges, Amazon Data Firehose provides real-time data delivery to Apache Iceberg tables within seconds. Firehose delivers high reliability across multiple Availability Zones while automatically scaling to match throughput requirements. It is fully managed and requires no infrastructure management or custom code development. Firehose delivers streaming data with configurable buffering options that can be optimized for near-zero latency. It also provides built-in data transformation, compression, and encryption capabilities, along with automatic retry mechanisms to provide reliable data delivery. This makes it an ideal choice for streaming AWS WAF logs directly into a data lake while minimizing operational overhead.
In this post, we demonstrate how to build a scalable AWS WAF log analysis solution using Firehose and Apache Iceberg. Firehose simplifies the entire process—from log ingestion to storage—by allowing you to configure a delivery stream that delivers AWS WAF logs directly to Apache Iceberg tables in Amazon S3. The solution requires no infrastructure setup and you pay only for the data you process.
Solution overview
To implement this solution, you first configure AWS WAF logging to capture web traffic information. This captures detailed information about traffic analyzed by the web access control lists (ACLs). Each log entry includes the request timestamp, detailed request information, and rule matches that were triggered. These logs are continuously streamed to Firehose in real time.
Firehose writes these logs into an Apache Iceberg table, which is stored in Amazon S3. When Firehose delivers data to the S3 table, it uses the AWS Glue Data Catalog to store and manage table metadata. This metadata includes schema information, partition details, and file locations, enabling seamless data discovery and querying across AWS analytics services.
Finally, security teams can analyze data in the Apache Iceberg tables using various AWS services, including Amazon Redshift, Amazon Athena, Amazon EMR, and Amazon SageMaker. For this demonstration, we use Athena to run SQL queries against the security logs.
The following diagram illustrates the solution architecture.
The implementation consists of four steps:
- Deploy the base infrastructure using AWS CloudFormation.
- Create an Apache Iceberg table using an AWS Glue notebook.
- Create a Firehose stream to handle the log data.
- Configure AWS WAF logging to send data to the Apache Iceberg table through the Firehose stream.
You can deploy the required resources into your AWS environment in the US East (N. Virginia) AWS Region using a CloudFormation template. This template creates an S3 bucket for storing AWS WAF logs, an AWS Glue database for the Apache Iceberg tables, and the AWS Identity and Access Management (IAM) roles and policies needed for the solution.
Prerequisites
Before you get started, make sure you have the following prerequisites:
- An AWS account with access to the US East (N. Virginia) Region
- AWS WAF configured with a web ACL in the US East (N. Virginia) Region
If you don’t have AWS WAF set up, refer to the AWS WAF Workshop to create a sample web application with AWS WAF.
AWS WAF logs use case-sensitive field names (like httpRequest
and webaclId
). For successful log ingestion, this solution uses the Apache Iceberg API through an AWS Glue job to create tables—this is a reliable approach that preserves the exact field names from the AWS WAF logs. Although AWS Glue crawlers and Athena DDLs offer convenient ways to create Apache Iceberg tables, they convert mixed-case column names to lowercase, which can affect AWS WAF log processing. By using an AWS Glue job with the Apache Iceberg API, case-sensitivity of column names is preserved, providing proper mapping between AWS WAF log fields and table columns.
Deploy the CloudFormation stack
Complete the following steps to deploy the solution resources with AWS CloudFormation:
- Sign in to the AWS CloudFormation console.
- Choose Launch Stack.
- Choose Next.
- For Stack name, leave as
WAF-Firehose-Iceberg-Stack
. - Under Parameters, specify whether AWS Lake Formation permissions are to be used for the AWS Glue tables.
- Choose Next.
- Select I acknowledge that AWS CloudFormation might create IAM resources with custom names and choose Next.
- Review the deployment and choose Submit.
The stack takes several minutes to deploy. After the deployment is complete, you can review the resources created by navigating to the Resources tab on the CloudFormation stack.
Create an Apache Iceberg table
Before setting up the Firehose delivery stream, you must create the destination Apache Iceberg table in the Data Catalog. This is done using AWS Glue jobs and the Apache Iceberg API, as discussed earlier. Complete the following steps to create an Apache Iceberg table:
- On the AWS Glue console, choose Notebooks under ETL jobs in the navigation pane.
- Choose Notebook option under Create job.
- Under Options, select Start fresh.
- For IAM role, choose
WAF-Firehose-Iceberg-Stack-GlueServiceRole-*
. - Choose Create notebook.
- Enter the following configuration command in the notebook to configure the Spark session with Apache Iceberg extensions. Be sure to update the configuration for
sql.catalog.glue_catalog.warehouse
to the S3 bucket created by the CloudFormation template.
- Enter the following SQL in the AWS Glue notebook to create the Apache Iceberg table:
- Navigate to the Data Catalog and
waf_logs_db
database to confirm the tablefirehose_waf_logs
is created.
Create a Firehose stream
Complete the following steps to create a Firehose stream:
- On the Data Firehose console, choose Create Firehose stream.
- Choose Direct PUT for Source and Apache Iceberg Tables for Destination.
- For Firehose stream name, enter
aws-waf-logs-firehose-iceberg-1
.
- In the Destination settings section, enable Inline parsing for routing information. Because we’re sending all records to one table, specify the destination database and table names:
- For Database expression, enter
"waf_logs_db"
. - For Table expression, enter
"firehose_waf_logs"
.
- For Database expression, enter
Make sure to include double quotation marks to use the literal value for the database and table name. If you don’t use double quotation marks, Firehose assumes that this is a JSON query expression and will attempt to parse the expression when processing your stream and fail. Firehose can also route to different Apache Iceberg Tables based on the content of the data. For more information, refer to Route incoming records to different Iceberg Tables.
- For S3 backup bucket, enter the S3 bucket created by the CloudFormation template.
- For S3 backup bucket error output prefix, enter
error/events-1/
.
- Under Advanced settings, select Enable server-side encryption for source records in Firehose stream.
- For Existing IAM roles, choose the role that starts with
WAF-Firehose-Iceberg-stack-FirehoseIAMRole-*
, created by the CloudFormation template. - Choose Create Firehose stream.
Configure AWS WAF logs to the Firehose stream
Complete the following steps to configure AWS WAF logs to the Firehose stream.
- On the AWS WAF console, choose Web ACLs in the navigation pane.
- Choose your web ACL.
- On the Logging and metrics tab, choose Enable.
- For Amazon Data Firehose stream, choose the stream
aws-waf-logs-firehose-iceberg-1
. - Choose Save.
Query and analyze the logs
You can query the data you’ve written to your Apache Iceberg tables using different processing engines, such as Apache Spark, Apache Flink, or Trino. In this example, we use Athena to query AWS WAF logs data stored in Apache Iceberg tables. Complete the following steps:
- On the Athena console, choose Settings in the top right corner.
- For Location of query result, enter the S3 bucket created by the CloudFormation template
s3://<S3BucketName>/athena/
- Enter the AWS account ID for Expected bucket owner and choose save.
- In the query editor, in Tables and views, choose the options menu next to
firehose_waf_logs
and choose Preview Table.
You should be able to see the AWS WAF logs in the Apache Iceberg tables by using Athena.
The following are some additional useful example queries:
- Identify potential attack sources by analyzing blocked IP addresses:
- Monitor attack patterns and trends over time:
Apache Iceberg table optimization
Although Firehose enables efficient streaming of AWS WAF logs into Apache Iceberg tables, the nature of streaming writes can result in many small files being created. This is because Firehose delivers data based on its buffering configuration, which can lead to suboptimal query performance. To address this, regular table optimization is recommended.
There are two recommended table optimization approaches:
- Compaction – Data compaction merges small data files to reduce storage usage and improve read performance. Data files are merged and rewritten to remove obsolete data and consolidate fragmented data into larger, more efficient files.
- Storage optimization – You can manage storage overhead by removing older, unnecessary snapshots and their associated underlying files. Additionally, this includes periodically deleting orphan files to maintain efficient storage utilization and optimal query performance.
These optimizations can be implemented using either the Data Catalog or Athena.
Table optimization using the Data Catalog
The Data Catalog provides automatic table optimization features. Within the table optimization feature, you can configure specific optimizers for compaction, snapshot retention, and orphan file deletion. A table optimization schedule can be managed and status can be monitored from the AWS Glue console.
Table optimization using Athena
Athena supports manual optimization through SQL commands. The OPTIMIZE
command rewrites small files into larger files and applies file compaction:
The VACUUM
command removes old snapshots and cleans up expired data files:
You can monitor the table’s optimization status using the following query:
Clean up
To avoid future charges, complete the following steps:
- Empty the S3 bucket.
- Delete the CloudFormation stack.
- Delete the Firehose stream.
- Disable AWS WAF logging.
Conclusion
In this post, we demonstrated how to build an AWS WAF log analytics pipeline using Firehose to deliver AWS WAF logs to Apache Iceberg tables on Amazon S3. The solution handles large-scale AWS WAF log processing without requiring complex code or infrastructure management. Although this post focused on Apache Iceberg tables as the destination, Data Firehose also seamlessly integrates with Amazon S3 tables. To optimize your tables for querying, Amazon S3 Tables continuously performs automatic maintenance operations, such as compaction, snapshot management, and unreferenced file removal. These operations increase table performance by compacting smaller objects into fewer, larger files.
To get started with your own implementation, try the solution in your AWS account and explore the following resources for additional features and best practices:
About the Authors
Charishma Makineni is a Senior Technical Account Manager at AWS. She provides strategic technical guidance for Independent Software Vendors (ISVs) to build and optimize solutions on AWS. She specializes in Big Data and Analytics technologies, helping organizations optimize their data-driven initiatives on AWS.
Phaneendra Vuliyaragoli is a Product Management Lead for Amazon Data Firehose at AWS. In this role, Phaneendra leads the product and go-to-market strategy for Amazon Data Firehose.