Big Data

Streamline AWS WAF log evaluation with Apache Iceberg and Amazon Knowledge Firehose

19 February 2025

Organizations are quickly increasing their digital presence, creating alternatives to serve prospects higher by way of net purposes. AWS WAF logs play an important position on this growth by enabling organizations to proactively monitor safety, implement compliance, and strengthen software protection. AWS WAF log evaluation is crucial throughout many industries, together with banking, retail, and healthcare, every needing to ship safe digital experiences.

To optimize their safety operations, organizations are adopting trendy approaches that mix real-time monitoring with scalable knowledge analytics. They’re utilizing knowledge lake architectures and Apache Iceberg to effectively course of giant volumes of safety knowledge whereas minimizing operational overhead. Apache Iceberg combines enterprise reliability with SQL simplicity when working with safety knowledge saved in Amazon Easy Storage Service (Amazon S3), enabling organizations to concentrate on safety insights fairly than infrastructure administration.

Apache Iceberg enhances safety analytics by way of a number of key capabilities. It seamlessly integrates with varied AWS providers and evaluation instruments whereas supporting concurrent read-write operations for simultaneous log ingestion and evaluation. Its time journey characteristic allows thorough safety forensics and incident investigation, and its schema evolution assist permits groups to adapt to rising safety patterns with out disrupting present workflows. These capabilities make Apache Iceberg a perfect selection for constructing strong safety analytics options. Nonetheless, organizations usually wrestle when constructing their very own options to ship knowledge to Apache Iceberg tables. These embody managing advanced extract, remodel, and cargo (ETL) processes, dealing with schema validation, offering dependable supply, and sustaining customized code for knowledge transformations. Groups should additionally construct resilient error dealing with, implement retry logic, and handle scaling infrastructure—all whereas sustaining knowledge consistency and excessive availability. These challenges take worthwhile time away from analyzing safety knowledge and deriving insights.

To deal with these challenges, Amazon Knowledge Firehose offers real-time knowledge supply to Apache Iceberg tables inside seconds. Firehose delivers excessive reliability throughout a number of Availability Zones whereas robotically scaling to match throughput necessities. It’s absolutely managed and requires no infrastructure administration or customized code improvement. Firehose delivers streaming knowledge with configurable buffering choices that may be optimized for near-zero latency. It additionally offers built-in knowledge transformation, compression, and encryption capabilities, together with automated retry mechanisms to offer dependable knowledge supply. This makes it a perfect selection for streaming AWS WAF logs immediately into a knowledge lake whereas minimizing operational overhead.

On this publish, we display methods to construct a scalable AWS WAF log evaluation answer utilizing Firehose and Apache Iceberg. Firehose simplifies your entire course of—from log ingestion to storage—by permitting you to configure a supply stream that delivers AWS WAF logs on to Apache Iceberg tables in Amazon S3. The answer requires no infrastructure setup and also you pay just for the info you course of.

Answer overview

To implement this answer, you first configure AWS WAF logging to seize net visitors data. This captures detailed details about visitors analyzed by the net entry management lists (ACLs). Every log entry contains the request timestamp, detailed request data, and rule matches that have been triggered. These logs are constantly streamed to Firehose in actual time.

Firehose writes these logs into an Apache Iceberg desk, which is saved in Amazon S3. When Firehose delivers knowledge to the S3 desk, it makes use of the AWS Glue Knowledge Catalog to retailer and handle desk metadata. This metadata contains schema data, partition particulars, and file areas, enabling seamless knowledge discovery and querying throughout AWS analytics providers.

Lastly, safety groups can analyze knowledge within the Apache Iceberg tables utilizing varied AWS providers, together with Amazon Redshift, Amazon Athena, Amazon EMR, and Amazon SageMaker. For this demonstration, we use Athena to run SQL queries towards the safety logs.

The next diagram illustrates the answer structure.

The implementation consists of 4 steps:

Deploy the bottom infrastructure utilizing AWS CloudFormation.
Create an Apache Iceberg desk utilizing an AWS Glue pocket book.
Create a Firehose stream to deal with the log knowledge.
Configure AWS WAF logging to ship knowledge to the Apache Iceberg desk by way of the Firehose stream.

You possibly can deploy the required sources into your AWS atmosphere within the US East (N. Virginia) AWS Area utilizing a CloudFormation template. This template creates an S3 bucket for storing AWS WAF logs, an AWS Glue database for the Apache Iceberg tables, and the AWS Id and Entry Administration (IAM) roles and insurance policies wanted for the answer.

Stipulations

Earlier than you get began, be sure you have the next stipulations:

An AWS account with entry to the US East (N. Virginia) Area
AWS WAF configured with an internet ACL within the US East (N. Virginia) Area

For those who don’t have AWS WAF arrange, check with the AWS WAF Workshop to create a pattern net software with AWS WAF.

AWS WAF logs use case-sensitive discipline names (like httpRequest and webaclId). For profitable log ingestion, this answer makes use of the Apache Iceberg API by way of an AWS Glue job to create tables—this can be a dependable strategy that preserves the precise discipline names from the AWS WAF logs. Though AWS Glue crawlers and Athena DDLs provide handy methods to create Apache Iceberg tables, they convert mixed-case column names to lowercase, which may have an effect on AWS WAF log processing. By utilizing an AWS Glue job with the Apache Iceberg API, case-sensitivity of column names is preserved, offering correct mapping between AWS WAF log fields and desk columns.

Deploy the CloudFormation stack

Full the next steps to deploy the answer sources with AWS CloudFormation:

Check in to the AWS CloudFormation console.
Select Launch Stack.
Select Subsequent.
For Stack title, go away as WAF-Firehose-Iceberg-Stack.
Underneath Parameters, specify whether or not AWS Lake Formation permissions are for use for the AWS Glue tables.
Select Subsequent.

Choose I acknowledge that AWS CloudFormation would possibly create IAM sources with customized names and select Subsequent.

Evaluation the deployment and select Submit.

The stack takes a number of minutes to deploy. After the deployment is full, you possibly can assessment the sources created by navigating to the Assets tab on the CloudFormation stack.

Create an Apache Iceberg desk

Earlier than establishing the Firehose supply stream, you need to create the vacation spot Apache Iceberg desk within the Knowledge Catalog. That is performed utilizing AWS Glue jobs and the Apache Iceberg API, as mentioned earlier. Full the next steps to create an Apache Iceberg desk:

On the AWS Glue console, select Notebooks beneath ETL jobs within the navigation pane.

Select Pocket book possibility beneath Create job.

Underneath Choices, choose Begin contemporary.
For IAM position, select WAF-Firehose-Iceberg-Stack-GlueServiceRole-*.
Select Create pocket book.

Enter the next configuration command within the pocket book to configure the Spark session with Apache Iceberg extensions. Be sure you replace the configuration for sql.catalog.glue_catalog.warehouse to the S3 bucket created by the CloudFormation template.

%%configure
{
    "--conf": "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3:///waflogdata --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO",
    "--datalake-formats": "iceberg"
}

Enter the next SQL within the AWS Glue pocket book to create the Apache Iceberg desk:

# Notice: This code makes use of Glue model 5.0 (as of April 2024)
# Please examine AWS Glue launch notes for the most recent model and replace accordingly:
# https://docs.aws.amazon.com/glue/newest/dg/release-notes.html
# To replace: Change the %glue_version parameter beneath to the most recent model

%idle_timeout 2880
%glue_version 5.0
%worker_type G.1X
%number_of_workers 5

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.conf import SparkConf

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

spark.sql(""" CREATE TABLE glue_catalog.waf_logs_db.firehose_waf_logs(
  `timestamp` bigint,
  `formatVersion` int,
  `webaclId` string,
  `terminatingRuleId` string,
  `terminatingRuleType` string,
  `motion` string,
  `terminatingRuleMatchDetails` array <
                                    struct <
                                        conditiontype: string,
                                        sensitivitylevel: string,
                                        location: string,
                                        matcheddata: array < string >
                                          >
                                     >,
  `httpSourceName` string,
  `httpSourceId` string,
  `ruleGroupList` array <
                      struct <
                          rulegroupid: string,
                          terminatingrule: struct <
                                              ruleid: string,
                                              motion: string,
                                              rulematchdetails: array <
                                                                   struct <
                                                                       conditiontype: string,
                                                                       sensitivitylevel: string,
                                                                       location: string,
                                                                       matcheddata: array < string >
                                                                          >
                                                                    >
                                                >,
                          nonterminatingmatchingrules: array <
                                                              struct <
                                                                  ruleid: string,
                                                                  motion: string,
                                                                  overriddenaction: string,
                                                                  rulematchdetails: array <
                                                                                       struct <
                                                                                           conditiontype: string,
                                                                                           sensitivitylevel: string,
                                                                                           location: string,
                                                                                           matcheddata: array < string >
                                                                                              >
                                                                   >,
                                                                  challengeresponse: struct <
                                                                            responsecode: string,
                                                                            solvetimestamp: string
                                                                              >,
                                                                  captcharesponse: struct <
                                                                            responsecode: string,
                                                                            solvetimestamp: string
                                                                              >
                                                                    >
                                                             >,
                          excludedrules: string
                            >
                       >,
`rateBasedRuleList` array <
                         struct <
                             ratebasedruleid: string,
                             limitkey: string,
                             maxrateallowed: int
                               >
                          >,
  `nonTerminatingMatchingRules` array <
                                    struct <
                                        ruleid: string,
                                        motion: string,
                                        rulematchdetails: array <
                                                             struct <
                                                                 conditiontype: string,
                                                                 sensitivitylevel: string,
                                                                 location: string,
                                                                 matcheddata: array < string >
                                                                    >
                                                             >,
                                        challengeresponse: struct <
                                                            responsecode: string,
                                                            solvetimestamp: string
                                                             >,
                                        captcharesponse: struct <
                                                            responsecode: string,
                                                            solvetimestamp: string
                                                             >
                                          >
                                     >,
  `requestHeadersInserted` array <
                                struct <
                                    title: string,
                                    worth: string
                                      >
                                 >,
  `responseCodeSent` string,
  `httpRequest` struct <
                    clientip: string,
                    nation: string,
                    headers: array <
                                struct <
                                    title: string,
                                    worth: string
                                      >
                                 >,
                    uri: string,
                    args: string,
                    httpversion: string,
                    httpmethod: string,
                    requestid: string
                      >,
  `labels` array <
               struct <
                   title: string
                     >
                >,
  `CaptchaResponse` struct <
                        responsecode: string,
                        solvetimestamp: string,
                        failureReason: string
                          >,
  `ChallengeResponse` struct <
                        responsecode: string,
                        solvetimestamp: string,
                        failureReason: string
                        >,
  `ja3Fingerprint` string,
  `overSizeFields` string,
  `requestBodySize` int,
  `requestBodySizeInspectedByWAF` int
)
USING iceberg
TBLPROPERTIES ("format-version"="2")
""")
job.commit()

Navigate to the Knowledge Catalog and waf_logs_db database to verify the desk firehose_waf_logs is created.

Create a Firehose stream

Full the next steps to create a Firehose stream:

On the Knowledge Firehose console, select Create Firehose stream.

Select Direct PUT for Supply and Apache Iceberg Tables for Vacation spot.

For Firehose stream title, enter aws-waf-logs-firehose-iceberg-1.

Within the Vacation spot settings part, allow Inline parsing for routing data. As a result of we’re sending all information to at least one desk, specify the vacation spot database and desk names:
1. For Database expression, enter "waf_logs_db".
2. For Desk expression, enter "firehose_waf_logs".

Be certain that to incorporate double citation marks to make use of the literal worth for the database and desk title. For those who don’t use double citation marks, Firehose assumes that this can be a JSON question expression and can try and parse the expression when processing your stream and fail. Firehose may also path to totally different Apache Iceberg Tables primarily based on the content material of the info. For extra data, check with Route incoming information to totally different Iceberg Tables.

For S3 backup bucket, enter the S3 bucket created by the CloudFormation template.
For S3 backup bucket error output prefix, enter error/events-1/.

Underneath Superior settings, choose Allow server-side encryption for supply information in Firehose stream.

For Current IAM roles, select the position that begins with WAF-Firehose-Iceberg-stack-FirehoseIAMRole-*, created by the CloudFormation template.
Select Create Firehose stream.

Configure AWS WAF logs to the Firehose stream

Full the next steps to configure AWS WAF logs to the Firehose stream.

On the AWS WAF console, select Net ACLs within the navigation pane.

Select your net ACL.
On the Logging and metrics tab, select Allow.

For Amazon Knowledge Firehose stream, select the stream aws-waf-logs-firehose-iceberg-1.
Select Save.

Question and analyze the logs

You possibly can question the info you’ve written to your Apache Iceberg tables utilizing totally different processing engines, equivalent to Apache Spark, Apache Flink, or Trino. On this instance, we use Athena to question AWS WAF logs knowledge saved in Apache Iceberg tables. Full the next steps:

On the Athena console, select Settings within the prime proper nook.
For Location of question outcome, enter the S3 bucket created by the CloudFormation template

s3:///athena/

Enter the AWS account ID for Anticipated bucket proprietor and select save.

Within the question editor, in Tables and views, select the choices menu subsequent to firehose_waf_logs and select Preview Desk.

You need to have the ability to see the AWS WAF logs within the Apache Iceberg tables by utilizing Athena.

The next are some extra helpful instance queries:

Determine potential assault sources by analyzing blocked IP addresses:

-- High 10 blocked IP addresses
SELECT httpRequest.clientip, COUNT() as block_count
FROM waf_logs_db.firehose_waf_logs
WHERE motion = 'BLOCK'
GROUP BY httpRequest.clientip
ORDER BY block_count DESC
LIMIT 10;

Monitor assault patterns and traits over time:

-- Price of blocked requests over time
SELECT DATE_TRUNC('hour', FROM_UNIXTIME(timestamp/1000)) as hour,
       COUNT() as request_count
FROM waf_logs_db.firehose_waf_logs
WHERE motion = 'BLOCK'
GROUP BY DATE_TRUNC('hour', FROM_UNIXTIME(timestamp/1000))
ORDER BY hour;

Apache Iceberg desk optimization

Though Firehose allows environment friendly streaming of AWS WAF logs into Apache Iceberg tables, the character of streaming writes may end up in many small recordsdata being created. It’s because Firehose delivers knowledge primarily based on its buffering configuration, which may result in suboptimal question efficiency. To deal with this, common desk optimization is really helpful.

There are two really helpful desk optimization approaches:

Compaction – Knowledge compaction merges small knowledge recordsdata to scale back storage utilization and enhance learn efficiency. Knowledge recordsdata are merged and rewritten to take away out of date knowledge and consolidate fragmented knowledge into bigger, extra environment friendly recordsdata.
Storage optimization – You possibly can handle storage overhead by eradicating older, pointless snapshots and their related underlying recordsdata. Moreover, this contains periodically deleting orphan recordsdata to take care of environment friendly storage utilization and optimum question efficiency.

These optimizations will be carried out utilizing both the Knowledge Catalog or Athena.

Desk optimization utilizing the Knowledge Catalog

The Knowledge Catalog offers automated desk optimization options. Throughout the desk optimization characteristic, you possibly can configure particular optimizers for compaction, snapshot retention, and orphan file deletion. A desk optimization schedule will be managed and standing will be monitored from the AWS Glue console.

Desk optimization utilizing Athena

Athena helps handbook optimization by way of SQL instructions. The OPTIMIZE command rewrites small recordsdata into bigger recordsdata and applies file compaction:

OPTIMIZE waf_logs_db.firehose_waf_logs REWRITE DATA USING BIN_PACK

The VACUUM command removes previous snapshots and cleans up expired knowledge recordsdata:

ALTER TABLE waf_logs_db.firehose_waf_logs SET TBLPROPERTIES (
  'vacuum_max_snapshot_age_seconds'='259200'
)

VACUUM waf_logs_db.firehose_waf_logs

You possibly can monitor the desk’s optimization standing utilizing the next question:

SELECT * FROM "waf_logs_db"."firehose_waf_logs$recordsdata"

Clear up

To keep away from future costs, full the next steps:

Empty the S3 bucket.
Delete the CloudFormation stack.
Delete the Firehose stream.
Disable AWS WAF logging.

Conclusion

On this publish, we demonstrated methods to construct an AWS WAF log analytics pipeline utilizing Firehose to ship AWS WAF logs to Apache Iceberg tables on Amazon S3. The answer handles large-scale AWS WAF log processing with out requiring advanced code or infrastructure administration. Though this publish targeted on Apache Iceberg tables because the vacation spot, Knowledge Firehose additionally seamlessly integrates with Amazon S3 tables. To optimize your tables for querying, Amazon S3 Tables constantly performs automated upkeep operations, equivalent to compaction, snapshot administration, and unreferenced file removing. These operations improve desk efficiency by compacting smaller objects into fewer, bigger recordsdata.

To get began with your personal implementation, strive the answer in your AWS account and discover the next sources for extra options and finest practices:

In regards to the Authors

Charishma Makineni is a Senior Technical Account Supervisor at AWS. She offers strategic technical steering for Impartial Software program Distributors (ISVs) to construct and optimize options on AWS. She focuses on Large Knowledge and Analytics applied sciences, serving to organizations optimize their data-driven initiatives on AWS.

Phaneendra Vuliyaragoli is a Product Administration Lead for Amazon Knowledge Firehose at AWS. On this position, Phaneendra leads the product and go-to-market technique for Amazon Knowledge Firehose.