Big Data

Use Amazon Kinesis Knowledge Streams to ship real-time information to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

11 November 2024

On this put up, we present find out how to use Amazon Kinesis Knowledge Streams to buffer and mixture real-time streaming information for supply into Amazon OpenSearch Service domains and collections utilizing Amazon OpenSearch Ingestion. You should utilize this method for quite a lot of use circumstances, from real-time log analytics to integrating software messaging information for real-time search. On this put up, we give attention to the use case for centralizing log aggregation for a company that has a compliance have to archive and retain its log information.

Kinesis Knowledge Streams is a completely managed, serverless information streaming service that shops and ingests varied streaming information in actual time at any scale. For log analytics use circumstances, Kinesis Knowledge Streams enhances log aggregation by decoupling producer and client functions, and offering a resilient, scalable buffer to seize and serve log information. This decoupling offers benefits over conventional architectures. As log producers scale up and down, Kinesis Knowledge Streams may be scaled dynamically to persistently buffer log information. This prevents load modifications from impacting an OpenSearch Service area, and offers a resilient retailer of log information for consumption. It additionally permits for a number of shoppers to course of log information in actual time, offering a persistent retailer of real-time information for functions to eat. This permits the log analytics pipeline to satisfy Properly-Architected finest practices for resilience (REL04-BP02) and price (COST09-BP02).

OpenSearch Ingestion is a serverless pipeline that gives highly effective instruments for extracting, reworking, and loading information into an OpenSearch Service area. OpenSearch Ingestion integrates with many AWS providers, and offers ready-made blueprints to speed up ingesting information for quite a lot of analytics use circumstances into OpenSearch Service domains. When paired with Kinesis Knowledge Streams, OpenSearch Ingestion permits for stylish real-time analytics of knowledge, and helps scale back the undifferentiated heavy lifting of making a real-time search and analytics structure.

Answer overview

On this resolution, we think about a typical use case for centralized log aggregation for a company. Organizations would possibly think about a centralized log aggregation method for quite a lot of causes. Many organizations have compliance and governance necessities which have stipulations for what information must be logged, and the way lengthy log information should be retained and stay searchable for investigations. Different organizations search to consolidate software and safety operations, and supply frequent observability toolsets and capabilities throughout their groups.

To satisfy such necessities, it’s worthwhile to accumulate information from log sources (producers) in a scalable, resilient, and cost-effective method. Log sources might fluctuate between software and infrastructure use circumstances and configurations, as illustrated within the following desk.

Log Producer	Instance	Instance Producer Log Configuration
Software Logs	AWS Lambda	Amazon CloudWatch Logs
Software Brokers	FluentBit	Amazon OpenSearch Ingestion
AWS Service Logs	Amazon Internet Software Firewall	Amazon S3

The next diagram illustrates an instance structure.

You should utilize Kinesis Knowledge Streams for quite a lot of these use circumstances. You may configure Amazon CloudWatch logs to ship information to Kinesis Knowledge Streams utilizing a subscription filter (see Actual-time processing of log information with subscriptions). For those who ship information with Kinesis Knowledge Streams for analytics use circumstances, you should use OpenSearch Ingestion to create a scalable, extensible pipeline to eat your streaming information and write it to OpenSearch Service indexes. Kinesis Knowledge Streams offers a buffer that may help a number of shoppers, configurable retention, and built-in integration with quite a lot of AWS providers. For different use circumstances the place information is saved in Amazon Easy Storage Service (Amazon S3), or the place an agent writes information comparable to FluentBit, an agent can write information on to OpenSearch Ingestion with out an intermediate buffer because of OpenSearch Ingestion’s built-in persistent buffers and automated scaling.

Standardizing logging approaches reduces improvement and operational overhead for organizations. For instance, you would possibly standardize on all functions logging to CloudWatch logs when possible, and likewise deal with Amazon S3 logs the place CloudWatch logs are unsupported. This reduces the variety of use circumstances {that a} centralized workforce must deal with of their log aggregation method, and reduces the complexity of the log aggregation resolution. For extra refined improvement groups, you would possibly standardize on utilizing FluentBit brokers to jot down information on to OpenSearch Ingestion to decrease price when log information doesn’t should be saved in CloudWatch.

This resolution focuses on utilizing CloudWatch logs as an information supply for log aggregation. For the Amazon S3 log use case, see Utilizing an OpenSearch Ingestion pipeline with Amazon S3. For agent-based options, see the agent-specific documentation for integration with OpenSearch Ingestion, comparable to Utilizing an OpenSearch Ingestion pipeline with Fluent Bit.

Stipulations

A number of key items of infrastructure used on this resolution are required to ingest information into OpenSearch Service with OpenSearch Ingestion:

A Kinesis information stream to mixture the log information from CloudWatch
An OpenSearch area to retailer the log information

When creating the Kinesis information stream, we suggest beginning with On-Demand mode. It will permit Kinesis Knowledge Streams to mechanically scale the variety of shards wanted in your log throughput. After you establish the regular state workload in your log aggregation use case, we suggest shifting to Provisioned mode, utilizing the variety of shards recognized in On-Demand mode. This may also help you optimize long-term price for high-throughput use circumstances.

Usually, we suggest utilizing one Kinesis information stream in your log aggregation workload. OpenSearch Ingestion helps as much as 96 OCUs per pipeline, and 24,000 characters per pipeline definition file (see OpenSearch Ingestion quotas). Which means every pipeline can help a Kinesis information stream with as much as 96 shards, as a result of every OCU processes one shard. Utilizing one Kinesis information stream simplifies the general course of to mixture log information into OpenSearch Service, and simplifies the method for creating and managing subscription filters for log teams.

Relying on the dimensions of your log workloads, and the complexity of your OpenSearch Ingestion pipeline logic, it’s possible you’ll think about extra Kinesis information streams in your use case. For instance, it’s possible you’ll think about one stream for every main log sort in your manufacturing workload. Having log information for various use circumstances separated into totally different streams may also help scale back the operational complexity of managing OpenSearch Ingestion pipelines, and lets you scale and deploy modifications to every log use case individually when required.

To create a Kinesis Knowledge Stream, see Create an information stream.

To create an OpenSearch area, see Creating and managing Amazon OpenSearch domains.

Configure log subscription filters

You may implement CloudWatch log group subscription filters on the account stage or log group stage. In each circumstances, we suggest making a subscription filter with a random distribution methodology to verify log information is evenly distributed throughout Kinesis information stream shards.

Account-level subscription filters are utilized to all log teams in an account, and can be utilized to subscribe all log information to a single vacation spot. This works properly if you wish to retailer all of your log information in OpenSearch Service utilizing Kinesis Knowledge Streams. There’s a restrict of 1 account-level subscription filter per account. Utilizing Kinesis Knowledge Streams because the vacation spot additionally lets you have a number of log shoppers to course of the account log information when related. To create an account-level subscription filter, see Account-level subscription filters.

Log group-level subscription filters are utilized on every log group. This method works properly if you wish to retailer a subset of your log information in OpenSearch Service utilizing Kinesis Knowledge Streams, and if you wish to use a number of totally different information streams to retailer and course of a number of log varieties. There’s a restrict of two log group-level subscription filters per log group. To create a log group-level subscription filter, see Log group-level subscription filters.

After you create your subscription filter, confirm that log information is being despatched to your Kinesis information stream. On the Kinesis Knowledge Streams console, select the hyperlink in your stream title.

Select a shard with Beginning place set as Trim horizon, and select Get information.

You must see information with a singular Partition key column worth and binary Knowledge column. It is because CloudWatch sends information in .gzip format to compress log information.

Configure an OpenSearch Ingestion pipeline

Now that you’ve a Kinesis information stream and CloudWatch subscription filters to ship information to the information stream, you’ll be able to configure your OpenSearch Ingestion pipeline to course of your log information. To start, you create an AWS Identification and Entry Administration (IAM) function that enables learn entry to the Kinesis information stream and skim/write entry to the OpenSearch area. To create your pipeline, your supervisor function that’s used to create the pipeline would require iam:PassRole permissions to the pipeline function created on this step.

Create an IAM function with the next permissions to learn out of your Kinesis information stream and entry your OpenSearch area:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "allowReadFromStream",
            "Effect": "Allow",
            "Action": [
                "kinesis:DescribeStream",
                "kinesis:DescribeStreamConsumer",
                "kinesis:DescribeStreamSummary",
                "kinesis:GetRecords",
                "kinesis:GetShardIterator",
                "kinesis:ListShards",
                "kinesis:ListStreams",
                "kinesis:ListStreamConsumers",
                "kinesis:RegisterStreamConsumer",
                "kinesis:SubscribeToShard"
            ],
            "Useful resource": [
                "arn:aws:kinesis:{{region}}:{{account-id}}:stream/{{stream-name}}"
            ]
        },
        {
            "Sid": "allowAccessToOS",
            "Impact": "Permit",
            "Motion": [
                "es:DescribeDomain",
                "es:ESHttp*"
            ],
            "Useful resource": [
                "arn:aws:es:{region}:{account-id}:domain/{domain-name}",
                "arn:aws:es:{region}:{account-id}:domain/{domain-name}/*"
            ]
        }
    ]
}

Give your function a belief coverage that enables entry from osis-pipelines.amazonaws.com:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "osis-pipelines.amazonaws.com"
                ]
            },
            "Motion": "sts:AssumeRole",
            "Situation": {
                "StringEquals": {
                    "aws:SourceAccount": "{account-id}"
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:osis:{area}:{account-id}:pipeline/*"
                }
            }
        }
    ]
}

For a pipeline to jot down information to a website, the area should have a domain-level entry coverage that enables the pipeline function to entry it, and in case your area makes use of fine-grained entry management, then the IAM function must be mapped to a backend function within the OpenSearch Service safety plugin that enables entry to create and write to indexes.

After you create your pipeline function, on the OpenSearch Service console, select Pipelines underneath Ingestion within the navigation pane.
Select Create pipeline.
Seek for Kinesis within the blueprints, choose the Kinesis Knowledge Streams blueprint, and select Choose blueprint.
Underneath Pipeline settings, enter a reputation in your pipeline, and set Max capability for the pipeline to be equal to the variety of shards in your Kinesis information stream.

For those who’re utilizing On-Demand mode for the information stream, select a capability equal to the present variety of shards within the stream. This use case doesn’t require a persistent buffer, as a result of Kinesis Knowledge Streams offers a persistent buffer for the log information, and OpenSearch Ingestion tracks its place within the Kinesis information stream over time, stopping information loss on restarts.

Underneath Pipeline configuration, replace the pipeline supply settings to make use of your Kinesis information stream title and pipeline IAM function Amazon Useful resource Identify (ARN).

For full configuration data, see . For many configurations, you should use the default values. By default, the pipeline will write batches of 100 paperwork each 1 second, and can subscribe to the Kinesis information stream from the newest place within the stream utilizing enhanced fan-out, checkpointing its place within the stream each 2 minutes. You may modify this conduct as desired to tune how steadily the buyer checkpoints, the place it begins within the stream, and use polling to cut back prices from enhanced fan-out.

  supply:
    kinesis-data-streams:
      acknowledgments: true
      codec:
        # JSON codec helps parsing nested CloudWatch occasions into
        # particular person log entries that can be written as paperwork to
        # OpenSearch
        json:
          key_name: "logEvents"
          # These keys include the metadata despatched by CloudWatch Subscription Filters
          # along with the person log occasions:
          # https://docs.aws.amazon.com/AmazonCloudWatch/newest/logs/SubscriptionFilters.html#DestinationKinesisExample
          include_keys: ['owner', 'logGroup', 'logStream' ]
      streams:
        # Replace to make use of your Kinesis Stream title utilized in your Subscription Filters:
        - stream_name: "KINESIS_STREAM_NAME"
          # Can customise preliminary place if you do not need OSI to eat all the stream:
          initial_position: "EARLIEST"
          # Compression will at all times be gzip for CloudWatch, however will fluctuate for different sources:
          compression: "gzip"
      aws:
        # Present the Function ARN with entry to KDS. This function ought to have a belief relationship with osis-pipelines.amazonaws.com
        # This should be the identical function used under within the Sink configuration.
        sts_role_arn: "PIPELINE_ROLE_ARN"
        # Present the area of the Knowledge Stream.
        area: "REGION"

Replace the pipeline sink settings to incorporate your OpenSearch area endpoint URL and pipeline IAM function ARN.

The IAM function ARN should be the identical for each the OpenSearch Servicer sink definition and the Kinesis Knowledge Streams supply definition. You may management what information will get listed in numerous indexes utilizing the index definition within the sink. For instance, you should use metadata concerning the Kinesis information stream title to index by information stream (${getMetadata("kinesis_stream_name")), or you should use doc fields to index information relying on the CloudWatch log group or different doc information (${path/to/discipline/in/doc}). On this instance, we use three document-level fields (data_stream.sort, data_stream.dataset, and data_stream.namespace) to index our paperwork, and create these fields in our pipeline processor logic within the subsequent part:

  sink:
    - opensearch:
        # Present an AWS OpenSearch Service area endpoint
        hosts: [ "OPENSEARCH_ENDPOINT" ]
        # Route log information to totally different goal indexes relying on the log context:
        index: "ss4o_${data_stream/sort}-${data_stream/dataset}-${data_stream/namespace}"
        aws:
          # Present a Function ARN with entry to the area. This function ought to have a belief relationship with osis-pipelines.amazonaws.com
          # This function should be the identical because the function used above for Kinesis.
          sts_role_arn: "PIPELINE_ROLE_ARN"
          # Present the area of the area.
          area: "REGION"
          # Allow the 'serverless' flag if the sink is an Amazon OpenSearch Serverless assortment
          serverless: false

Lastly, you’ll be able to replace the pipeline configuration to incorporate processor definitions to rework your log information earlier than writing paperwork to the OpenSearch area. For instance, this use case adopts Easy Schema for Observability (SS4O) and makes use of the OpenSearch Ingestion pipeline to create the specified schema for SS4O. This consists of including frequent fields to affiliate metadata with the listed paperwork, in addition to parsing the log information to make information extra searchable. This use case additionally makes use of the log group title to establish totally different log varieties as datasets, and makes use of this data to jot down paperwork to totally different indexes relying on their use circumstances.

Rename the CloudWatch occasion timestamp to mark the noticed timestamp when the log was generated utilizing the rename_keys processor, and add the present timestamp because the processed timestamp when OpenSearch Ingestion dealt with the document utilizing the date processor:

  #  Processor logic is used to alter how log information is parsed for OpenSearch.
  processor:
    - rename_keys:
        entries:
        # Embody CloudWatch timestamp because the remark timestamp - the time the log
        # was generated and despatched to CloudWatch:
        - from_key: "timestamp"
          to_key: "observed_timestamp"
    - date:
        # Embody the present timestamp that OSI processed the log occasion:
        from_time_received: true
        vacation spot: "processed_timestamp"

Use the add_entries processor to incorporate metadata concerning the processed doc, together with the log group, log stream, account ID, AWS Area, Kinesis information stream data, and dataset metadata:

    - add_entries:
        entries:
        # Help SS4O frequent log fields (https://opensearch.org/docs/newest/observing-your-data/ss4o/)
        - key: "cloud/supplier"
          worth: "aws"
        - key: "cloud/account/id"
          format: "${proprietor}"
        - key: "cloud/area"
          worth: "us-west-2"
        - key: "aws/cloudwatch/log_group"
          format: "${logGroup}"
        - key: "aws/cloudwatch/log_stream"
          format: "${logStream}"
        # Embody default values for the data_stream:
        - key: "data_stream/namespace"
          worth: "default"
        - key: "data_stream/sort"
          worth: "logs"
        - key: "data_stream/dataset"
          worth: "common"
        # Embody metadata concerning the supply Kinesis message that contained this log occasion:
        - key: "aws/kinesis/stream_name"
          value_expression: "getMetadata("stream_name")"
        - key: "aws/kinesis/partition_key"
          value_expression: "getMetadata("partition_key")"
        - key: "aws/kinesis/sequence_number"
          value_expression: "getMetadata("sequence_number")"
        - key: "aws/kinesis/sub_sequence_number"
          value_expression: "getMetadata("sub_sequence_number")"

Use conditional expression syntax to replace the data_stream.dataset fields relying on the log supply, to manage what index the doc is written to, and use the delete_entries processor to delete the unique CloudWatch doc fields that had been renamed:

    - add_entries:
        entries:
        # Replace the data_stream fields based mostly on the log occasion context - on this case
        # classifying the log occasions by their supply (CloudTrail or Lambda).
        # Further logic might be added to categorise the logs by enterprise or software context:
        - key: "data_stream/dataset"
          worth: "cloudtrail"
          add_when: "comprises(/logGroup, "cloudtrail") or comprises(/logGroup, "CloudTrail")"
          overwrite_if_key_exists: true
        - key: "data_stream/dataset"
          worth: "lambda"
          add_when: "comprises(/logGroup, "/aws/lambda/")"
          overwrite_if_key_exists: true
        - key: "data_stream/dataset"
          worth: "apache"
          add_when: "comprises(/logGroup, "/apache/")"
          overwrite_if_key_exists: true
    # Take away the default CloudWatch fields, as we re-mapped them to SS4O fields:
    - delete_entries:
        with_keys:
          - "logGroup"
          - "logStream"
          - "proprietor"

Parse the log message fields to permit structured and JSON information to be extra searchable within the OpenSearch indexes utilizing the grok and parse_json

Grok processors use sample matching to parse information from structured textual content fields. For examples of built-in Grok patterns, see java-grok patterns and dataprepper grok patterns.

    # Use Grok parser to parse non-JSON apache logs
    - grok:
        grok_when: "/data_stream/dataset == "apache""
        match:
          message: ['%{COMMONAPACHELOG_DATATYPED}']
        target_key: "http"
    # Try to parse the log information as JSON to help field-level searches within the OpenSearch index:
    - parse_json:
        # Parse root message object into aws.cloudtrail to match SS4O normal for SS4O logs
        supply: "message"
        vacation spot: "aws/cloudtrail"
        parse_when: "/data_stream/dataset == "cloudtrail""
        tags_on_failure: ["json_parse_fail"]
    - parse_json:
        # Parse root message object as JSON when attainable for Lambda operate logs - may arrange Grok help
        # for Lambda operate logs to seize non-JSON logging operate information as searchable fields
        supply: "message"
        vacation spot: "aws/lambda"
        parse_when: "/data_stream/dataset == "lambda""
        tags_on_failure: ["json_parse_fail"]
    - parse_json:
        # Parse root message object as JSON when attainable for common logs
        supply: "message"
        vacation spot: "physique"
        parse_when: "/data_stream/dataset == "common""
        tags_on_failure: ["json_parse_fail"]

When it’s all put collectively, your pipeline configuration will appear to be the next code:

model: "2"
kinesis-pipeline:
  supply:
    kinesis-data-streams:
      acknowledgments: true
      codec:
        # JSON codec helps parsing nested CloudWatch occasions into
        # particular person log entries that can be written as paperwork to
        # OpenSearch
        json:
          key_name: "logEvents"
          # These keys include the metadata despatched by CloudWatch Subscription Filters
          # along with the person log occasions:
          # https://docs.aws.amazon.com/AmazonCloudWatch/newest/logs/SubscriptionFilters.html#DestinationKinesisExample
          include_keys: ['owner', 'logGroup', 'logStream' ]
      streams:
        # Replace to make use of your Kinesis Stream title utilized in your Subscription Filters:
        - stream_name: "KINESIS_STREAM_NAME"
          # Can customise preliminary place if you do not need OSI to eat all the stream:
          initial_position: "EARLIEST"
          # Compression will at all times be gzip for CloudWatch, however will fluctuate for different sources:
          compression: "gzip"
      aws:
        # Present the Function ARN with entry to KDS. This function ought to have a belief relationship with osis-pipelines.amazonaws.com
        # This should be the identical function used under within the Sink configuration.
        sts_role_arn: "PIPELINE_ROLE_ARN"
        # Present the area of the Knowledge Stream.
        area: "REGION"
        
  #  Processor logic is used to alter how log information is parsed for OpenSearch.
  processor:
    - rename_keys:
        entries:
        # Embody CloudWatch timestamp because the remark timestamp - the time the log
        # was generated and despatched to CloudWatch:
        - from_key: "timestamp"
          to_key: "observed_timestamp"
    - date:
        # Embody the present timestamp that OSI processed the log occasion:
        from_time_received: true
        vacation spot: "processed_timestamp"
    - add_entries:
        entries:
        # Help SS4O frequent log fields (https://opensearch.org/docs/newest/observing-your-data/ss4o/)
        - key: "cloud/supplier"
          worth: "aws"
        - key: "cloud/account/id"
          format: "${proprietor}"
        - key: "cloud/area"
          worth: "us-west-2"
        - key: "aws/cloudwatch/log_group"
          format: "${logGroup}"
        - key: "aws/cloudwatch/log_stream"
          format: "${logStream}"
        # Embody default values for the data_stream:
        - key: "data_stream/namespace"
          worth: "default"
        - key: "data_stream/sort"
          worth: "logs"
        - key: "data_stream/dataset"
          worth: "common"
        # Embody metadata concerning the supply Kinesis message that contained this log occasion:
        - key: "aws/kinesis/stream_name"
          value_expression: "getMetadata("stream_name")"
        - key: "aws/kinesis/partition_key"
          value_expression: "getMetadata("partition_key")"
        - key: "aws/kinesis/sequence_number"
          value_expression: "getMetadata("sequence_number")"
        - key: "aws/kinesis/sub_sequence_number"
          value_expression: "getMetadata("sub_sequence_number")"
    - add_entries:
        entries:
        # Replace the data_stream fields based mostly on the log occasion context - on this case
        # classifying the log occasions by their supply (CloudTrail or Lambda).
        # Further logic might be added to categorise the logs by enterprise or software context:
        - key: "data_stream/dataset"
          worth: "cloudtrail"
          add_when: "comprises(/logGroup, "cloudtrail") or comprises(/logGroup, "CloudTrail")"
          overwrite_if_key_exists: true
        - key: "data_stream/dataset"
          worth: "lambda"
          add_when: "comprises(/logGroup, "/aws/lambda/")"
          overwrite_if_key_exists: true
        - key: "data_stream/dataset"
          worth: "apache"
          add_when: "comprises(/logGroup, "/apache/")"
          overwrite_if_key_exists: true
    # Take away the default CloudWatch fields, as we re-mapped them to SS4O fields:
    - delete_entries:
        with_keys:
          - "logGroup"
          - "logStream"
          - "proprietor"
    # Use Grok parser to parse non-JSON apache logs
    - grok:
        grok_when: "/data_stream/dataset == "apache""
        match:
          message: ['%{COMMONAPACHELOG_DATATYPED}']
        target_key: "http"
    # Try to parse the log information as JSON to help field-level searches within the OpenSearch index:
    - parse_json:
        # Parse root message object into aws.cloudtrail to match SS4O normal for SS4O logs
        supply: "message"
        vacation spot: "aws/cloudtrail"
        parse_when: "/data_stream/dataset == "cloudtrail""
        tags_on_failure: ["json_parse_fail"]
    - parse_json:
        # Parse root message object as JSON when attainable for Lambda operate logs - may arrange Grok help
        # for Lambda operate logs to seize non-JSON logging operate information as searchable fields
        supply: "message"
        vacation spot: "aws/lambda"
        parse_when: "/data_stream/dataset == "lambda""
        tags_on_failure: ["json_parse_fail"]
    - parse_json:
        # Parse root message object as JSON when attainable for common logs
        supply: "message"
        vacation spot: "physique"
        parse_when: "/data_stream/dataset == "common""
        tags_on_failure: ["json_parse_fail"]

  sink:
    - opensearch:
        # Present an AWS OpenSearch Service area endpoint
        hosts: [ "OPENSEARCH_ENDPOINT" ]
        # Route log information to totally different goal indexes relying on the log context:
        index: "ss4o_${data_stream/sort}-${data_stream/dataset}-${data_stream/namespace}"
        aws:
          # Present a Function ARN with entry to the area. This function ought to have a belief relationship with osis-pipelines.amazonaws.com
          # This function should be the identical because the function used above for Kinesis.
          sts_role_arn: "PIPELINE_ROLE_ARN"
          # Present the area of the area.
          area: "REGION"
          # Allow the 'serverless' flag if the sink is an Amazon OpenSearch Serverless assortment
          serverless: false

When your configuration is full, select Validate pipeline to examine your pipeline syntax for errors.
Within the Pipeline function part, optionally enter a suffix to create a singular service function that can be used to start out your pipeline run.
Within the Community part, choose VPC entry.

For a Kinesis Knowledge Streams supply, you don’t want to pick a digital personal cloud (VPC), subnets, or safety teams. OpenSearch Ingestion solely requires these attributes for HTTP information sources which can be positioned inside a VPC. For Kinesis Knowledge Streams, OpenSearch Ingestion makes use of AWS PrivateLink to learn from Kinesis Knowledge Streams and write to OpenSearch domains or serverless collections.

Optionally, allow CloudWatch logging in your pipeline.
Select Subsequent to overview and create your pipeline.

For those who’re utilizing account-level subscription filters for CloudWatch logs within the account the place OpenSearch Ingestion is operating, this log group needs to be excluded from the account-level subscription. It is because OpenSearch Ingestion pipeline logs might trigger a recursive loop with the subscription filter that might result in excessive volumes of log information ingestion and price.

Within the Evaluation and create part, select Create pipeline.

When your pipeline enters the Energetic state, you’ll see logs start to populate in your OpenSearch area or serverless assortment.

Monitor the answer

To take care of the well being of the log ingestion pipeline, there are a number of key areas to watch:

Kinesis Knowledge Streams metrics – You must monitor the next metrics:
- FailedRecords – Signifies a problem in CloudWatch subscription filters writing to the Kinesis information stream. Attain out to AWS Help if this metric stays at a non-zero stage for a sustained interval.
- ThrottledRecords – Signifies your Kinesis information stream wants extra shards to accommodate the log quantity from CloudWatch.
- ReadProvisionedThroughputExceeded – Signifies your Kinesis information stream has extra shoppers consuming learn throughput than equipped by the shard limits, and it’s possible you’ll want to maneuver to an enhanced fan-out client technique.
- WriteProvisionedThroughputExceeded – Signifies your Kinesis information stream wants extra shards to accommodate the log quantity from CloudWatch, or that your log quantity is being erratically distributed to your shards. Make sure that the subscription filter distribution technique is ready to random, and think about enabling enhanced shard-level monitoring on the information stream to establish sizzling shards.
- RateExceeded – Signifies {that a} client is incorrectly configured for the stream, and there could also be a problem in your OpenSearch Ingestion pipeline inflicting it to subscribe too typically. Examine your client technique for the Kinesis information stream.
- MillisBehindLatest – Signifies the improved fan-out client isn’t maintaining with the load within the information stream. Examine the OpenSearch Ingestion pipeline OCU configuration and ensure there are enough OCUs to accommodate the Kinesis information stream shards.
- IteratorAgeMilliseconds – Signifies the polling client isn’t maintaining with the load within the information stream. Examine the OpenSearch Ingestion pipeline OCU configuration and ensure there are enough OCUs to accommodate the Kinesis information stream shards, and examine the polling technique for the buyer.
CloudWatch subscription filter metrics – You must monitor the next metrics:
- DeliveryErrors – Signifies a problem in CloudWatch subscription filter delivering information to the Kinesis information stream. Examine information stream metrics.
- DeliveryThrottling – Signifies inadequate capability within the Kinesis information stream. Examine information stream metrics.
OpenSearch Ingestion metrics – For really helpful monitoring for OpenSearch Ingestion, see Really helpful CloudWatch alarms.
OpenSearch Service metrics – For really helpful monitoring for OpenSearch Service, see Really helpful CloudWatch alarms for Amazon OpenSearch Service.

Clear up

Ensure you clear up undesirable AWS sources created whereas following this put up with a purpose to forestall further billing for these sources. Comply with these steps to wash up your AWS account:

Delete your Kinesis information stream.
Delete your OpenSearch Service area.
Use the DeleteAccountPolicy API to take away your account-level CloudWatch subscription filter.
Delete your log group-level CloudWatch subscription filter:
1. On the CloudWatch console, choose the specified log group.
2. On the Actions menu, select Subscription Filters and Delete all subscription filter(s).
Delete the OpenSearch Ingestion pipeline.

Conclusion

On this put up, you realized find out how to create a serverless ingestion pipeline to ship CloudWatch logs in actual time to an OpenSearch area or serverless assortment utilizing OpenSearch Ingestion. You should utilize this method for quite a lot of real-time information ingestion use circumstances, and add it to present workloads that use Kinesis Knowledge Streams for real-time information analytics.

For different use circumstances for OpenSearch Ingestion and Kinesis Knowledge Streams, think about the next:

To proceed bettering your log analytics use circumstances in OpenSearch, think about using a few of the pre-built dashboards accessible in Integrations in OpenSearch Dashboards.

In regards to the authors

M Mehrtens has been working in distributed techniques engineering all through their profession, working as a Software program Engineer, Architect, and Knowledge Engineer. Prior to now, M has supported and constructed techniques to course of terrabytes of streaming information at low latency, run enterprise Machine Studying pipelines, and created techniques to share information throughout groups seamlessly with various information toolsets and software program stacks. At AWS, they’re a Sr. Options Architect supporting US Federal Monetary clients.

Arjun Nambiar is a Product Supervisor with Amazon OpenSearch Service. He focuses on ingestion applied sciences that allow ingesting information from all kinds of sources into Amazon OpenSearch Service at scale. Arjun is serious about large-scale distributed techniques and cloud-centered applied sciences, and is predicated out of Seattle, Washington.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search functions and options. Muthu is within the subjects of networking and safety, and is predicated out of Austin, Texas.