Organizations are quickly increasing their digital presence, creating alternatives to serve prospects higher by way of net purposes. AWS WAF logs play an important position on this growth by enabling organizations to proactively monitor safety, implement compliance, and strengthen software protection. AWS WAF log evaluation is crucial throughout many industries, together with banking, retail, and healthcare, every needing to ship safe digital experiences.
To optimize their safety operations, organizations are adopting trendy approaches that mix real-time monitoring with scalable knowledge analytics. They’re utilizing knowledge lake architectures and Apache Iceberg to effectively course of giant volumes of safety knowledge whereas minimizing operational overhead. Apache Iceberg combines enterprise reliability with SQL simplicity when working with safety knowledge saved in Amazon Easy Storage Service (Amazon S3), enabling organizations to concentrate on safety insights fairly than infrastructure administration.
Apache Iceberg enhances safety analytics by way of a number of key capabilities. It seamlessly integrates with varied AWS providers and evaluation instruments whereas supporting concurrent read-write operations for simultaneous log ingestion and evaluation. Its time journey characteristic allows thorough safety forensics and incident investigation, and its schema evolution assist permits groups to adapt to rising safety patterns with out disrupting present workflows. These capabilities make Apache Iceberg a perfect selection for constructing strong safety analytics options. Nonetheless, organizations usually wrestle when constructing their very own options to ship knowledge to Apache Iceberg tables. These embody managing advanced extract, remodel, and cargo (ETL) processes, dealing with schema validation, offering dependable supply, and sustaining customized code for knowledge transformations. Groups should additionally construct resilient error dealing with, implement retry logic, and handle scaling infrastructure—all whereas sustaining knowledge consistency and excessive availability. These challenges take worthwhile time away from analyzing safety knowledge and deriving insights.
To deal with these challenges, Amazon Knowledge Firehose offers real-time knowledge supply to Apache Iceberg tables inside seconds. Firehose delivers excessive reliability throughout a number of Availability Zones whereas robotically scaling to match throughput necessities. It’s absolutely managed and requires no infrastructure administration or customized code improvement. Firehose delivers streaming knowledge with configurable buffering choices that may be optimized for near-zero latency. It additionally offers built-in knowledge transformation, compression, and encryption capabilities, together with automated retry mechanisms to offer dependable knowledge supply. This makes it a perfect selection for streaming AWS WAF logs immediately into a knowledge lake whereas minimizing operational overhead.
On this publish, we display methods to construct a scalable AWS WAF log evaluation answer utilizing Firehose and Apache Iceberg. Firehose simplifies your entire course of—from log ingestion to storage—by permitting you to configure a supply stream that delivers AWS WAF logs on to Apache Iceberg tables in Amazon S3. The answer requires no infrastructure setup and also you pay just for the info you course of.
Answer overview
To implement this answer, you first configure AWS WAF logging to seize net visitors data. This captures detailed details about visitors analyzed by the net entry management lists (ACLs). Every log entry contains the request timestamp, detailed request data, and rule matches that have been triggered. These logs are constantly streamed to Firehose in actual time.
Firehose writes these logs into an Apache Iceberg desk, which is saved in Amazon S3. When Firehose delivers knowledge to the S3 desk, it makes use of the AWS Glue Knowledge Catalog to retailer and handle desk metadata. This metadata contains schema data, partition particulars, and file areas, enabling seamless knowledge discovery and querying throughout AWS analytics providers.
Lastly, safety groups can analyze knowledge within the Apache Iceberg tables utilizing varied AWS providers, together with Amazon Redshift, Amazon Athena, Amazon EMR, and Amazon SageMaker. For this demonstration, we use Athena to run SQL queries towards the safety logs.
The next diagram illustrates the answer structure.
The implementation consists of 4 steps:
- Deploy the bottom infrastructure utilizing AWS CloudFormation.
- Create an Apache Iceberg desk utilizing an AWS Glue pocket book.
- Create a Firehose stream to deal with the log knowledge.
- Configure AWS WAF logging to ship knowledge to the Apache Iceberg desk by way of the Firehose stream.
You possibly can deploy the required sources into your AWS atmosphere within the US East (N. Virginia) AWS Area utilizing a CloudFormation template. This template creates an S3 bucket for storing AWS WAF logs, an AWS Glue database for the Apache Iceberg tables, and the AWS Id and Entry Administration (IAM) roles and insurance policies wanted for the answer.
Stipulations
Earlier than you get began, be sure you have the next stipulations:
- An AWS account with entry to the US East (N. Virginia) Area
- AWS WAF configured with an internet ACL within the US East (N. Virginia) Area
For those who don’t have AWS WAF arrange, check with the AWS WAF Workshop to create a pattern net software with AWS WAF.
AWS WAF logs use case-sensitive discipline names (like httpRequest
and webaclId
). For profitable log ingestion, this answer makes use of the Apache Iceberg API by way of an AWS Glue job to create tables—this can be a dependable strategy that preserves the precise discipline names from the AWS WAF logs. Though AWS Glue crawlers and Athena DDLs provide handy methods to create Apache Iceberg tables, they convert mixed-case column names to lowercase, which may have an effect on AWS WAF log processing. By utilizing an AWS Glue job with the Apache Iceberg API, case-sensitivity of column names is preserved, offering correct mapping between AWS WAF log fields and desk columns.
Deploy the CloudFormation stack
Full the next steps to deploy the answer sources with AWS CloudFormation:
- Check in to the AWS CloudFormation console.
- Select Launch Stack.
- Select Subsequent.
- For Stack title, go away as
WAF-Firehose-Iceberg-Stack
. - Underneath Parameters, specify whether or not AWS Lake Formation permissions are for use for the AWS Glue tables.
- Select Subsequent.
- Choose I acknowledge that AWS CloudFormation would possibly create IAM sources with customized names and select Subsequent.
- Evaluation the deployment and select Submit.
The stack takes a number of minutes to deploy. After the deployment is full, you possibly can assessment the sources created by navigating to the Assets tab on the CloudFormation stack.
Create an Apache Iceberg desk
Earlier than establishing the Firehose supply stream, you need to create the vacation spot Apache Iceberg desk within the Knowledge Catalog. That is performed utilizing AWS Glue jobs and the Apache Iceberg API, as mentioned earlier. Full the next steps to create an Apache Iceberg desk:
- On the AWS Glue console, select Notebooks beneath ETL jobs within the navigation pane.
- Select Pocket book possibility beneath Create job.
- Underneath Choices, choose Begin contemporary.
- For IAM position, select
WAF-Firehose-Iceberg-Stack-GlueServiceRole-*
. - Select Create pocket book.
- Enter the next configuration command within the pocket book to configure the Spark session with Apache Iceberg extensions. Be sure you replace the configuration for
sql.catalog.glue_catalog.warehouse
to the S3 bucket created by the CloudFormation template.
- Enter the next SQL within the AWS Glue pocket book to create the Apache Iceberg desk:
- Navigate to the Knowledge Catalog and
waf_logs_db
database to verify the deskfirehose_waf_logs
is created.
Create a Firehose stream
Full the next steps to create a Firehose stream:
- On the Knowledge Firehose console, select Create Firehose stream.
- Select Direct PUT for Supply and Apache Iceberg Tables for Vacation spot.
- For Firehose stream title, enter
aws-waf-logs-firehose-iceberg-1
.
- Within the Vacation spot settings part, allow Inline parsing for routing data. As a result of we’re sending all information to at least one desk, specify the vacation spot database and desk names:
- For Database expression, enter
"waf_logs_db"
. - For Desk expression, enter
"firehose_waf_logs"
.
- For Database expression, enter
Be certain that to incorporate double citation marks to make use of the literal worth for the database and desk title. For those who don’t use double citation marks, Firehose assumes that this can be a JSON question expression and can try and parse the expression when processing your stream and fail. Firehose may also path to totally different Apache Iceberg Tables primarily based on the content material of the info. For extra data, check with Route incoming information to totally different Iceberg Tables.
- For S3 backup bucket, enter the S3 bucket created by the CloudFormation template.
- For S3 backup bucket error output prefix, enter
error/events-1/
.
- Underneath Superior settings, choose Allow server-side encryption for supply information in Firehose stream.
- For Current IAM roles, select the position that begins with
WAF-Firehose-Iceberg-stack-FirehoseIAMRole-*
, created by the CloudFormation template. - Select Create Firehose stream.
Configure AWS WAF logs to the Firehose stream
Full the next steps to configure AWS WAF logs to the Firehose stream.
- On the AWS WAF console, select Net ACLs within the navigation pane.
- Select your net ACL.
- On the Logging and metrics tab, select Allow.
- For Amazon Knowledge Firehose stream, select the stream
aws-waf-logs-firehose-iceberg-1
. - Select Save.
Question and analyze the logs
You possibly can question the info you’ve written to your Apache Iceberg tables utilizing totally different processing engines, equivalent to Apache Spark, Apache Flink, or Trino. On this instance, we use Athena to question AWS WAF logs knowledge saved in Apache Iceberg tables. Full the next steps:
- On the Athena console, select Settings within the prime proper nook.
- For Location of question outcome, enter the S3 bucket created by the CloudFormation template
s3://
- Enter the AWS account ID for Anticipated bucket proprietor and select save.
- Within the question editor, in Tables and views, select the choices menu subsequent to
firehose_waf_logs
and select Preview Desk.
You need to have the ability to see the AWS WAF logs within the Apache Iceberg tables by utilizing Athena.
The next are some extra helpful instance queries:
- Determine potential assault sources by analyzing blocked IP addresses:
- Monitor assault patterns and traits over time:
Apache Iceberg desk optimization
Though Firehose allows environment friendly streaming of AWS WAF logs into Apache Iceberg tables, the character of streaming writes may end up in many small recordsdata being created. It’s because Firehose delivers knowledge primarily based on its buffering configuration, which may result in suboptimal question efficiency. To deal with this, common desk optimization is really helpful.
There are two really helpful desk optimization approaches:
- Compaction – Knowledge compaction merges small knowledge recordsdata to scale back storage utilization and enhance learn efficiency. Knowledge recordsdata are merged and rewritten to take away out of date knowledge and consolidate fragmented knowledge into bigger, extra environment friendly recordsdata.
- Storage optimization – You possibly can handle storage overhead by eradicating older, pointless snapshots and their related underlying recordsdata. Moreover, this contains periodically deleting orphan recordsdata to take care of environment friendly storage utilization and optimum question efficiency.
These optimizations will be carried out utilizing both the Knowledge Catalog or Athena.
Desk optimization utilizing the Knowledge Catalog
The Knowledge Catalog offers automated desk optimization options. Throughout the desk optimization characteristic, you possibly can configure particular optimizers for compaction, snapshot retention, and orphan file deletion. A desk optimization schedule will be managed and standing will be monitored from the AWS Glue console.
Desk optimization utilizing Athena
Athena helps handbook optimization by way of SQL instructions. The OPTIMIZE
command rewrites small recordsdata into bigger recordsdata and applies file compaction:
The VACUUM
command removes previous snapshots and cleans up expired knowledge recordsdata:
You possibly can monitor the desk’s optimization standing utilizing the next question:
Clear up
To keep away from future costs, full the next steps:
- Empty the S3 bucket.
- Delete the CloudFormation stack.
- Delete the Firehose stream.
- Disable AWS WAF logging.
Conclusion
On this publish, we demonstrated methods to construct an AWS WAF log analytics pipeline utilizing Firehose to ship AWS WAF logs to Apache Iceberg tables on Amazon S3. The answer handles large-scale AWS WAF log processing with out requiring advanced code or infrastructure administration. Though this publish targeted on Apache Iceberg tables because the vacation spot, Knowledge Firehose additionally seamlessly integrates with Amazon S3 tables. To optimize your tables for querying, Amazon S3 Tables constantly performs automated upkeep operations, equivalent to compaction, snapshot administration, and unreferenced file removing. These operations improve desk efficiency by compacting smaller objects into fewer, bigger recordsdata.
To get began with your personal implementation, strive the answer in your AWS account and discover the next sources for extra options and finest practices:
In regards to the Authors
Charishma Makineni is a Senior Technical Account Supervisor at AWS. She offers strategic technical steering for Impartial Software program Distributors (ISVs) to construct and optimize options on AWS. She focuses on Large Knowledge and Analytics applied sciences, serving to organizations optimize their data-driven initiatives on AWS.
Phaneendra Vuliyaragoli is a Product Administration Lead for Amazon Knowledge Firehose at AWS. On this position, Phaneendra leads the product and go-to-market technique for Amazon Knowledge Firehose.