How to Fix a Yellow Cluster Status in Wazuh Indexer

A healthy Wazuh deployment depends heavily on the stability of the Wazuh Indexer cluster.

When administrators log in to the dashboard and discover a yellow cluster status accompanied by unassigned shards, it often raises concerns about data availability, alert processing, and overall platform reliability.

The Wazuh Indexer, which is built on the OpenSearch search and analytics engine, stores security events, alerts, inventory data, vulnerability information, and other operational records generated throughout the Wazuh ecosystem.

Because the indexer serves as the data layer for the entire platform, any cluster health issue can have downstream effects on detection capabilities and dashboard functionality.

A yellow cluster status indicates that all primary shards are available, but one or more replica shards cannot be assigned to cluster nodes.

While the system remains operational, the cluster is running without full redundancy.

This means that if another node failure occurs before the replica shards are successfully allocated, some data may become unavailable.

Unassigned shards can also create several operational problems, including:

  • Reduced fault tolerance
  • Increased recovery times during failures
  • Slower search and query performance
  • Delayed alert visibility in the dashboard
  • Indexing bottlenecks during peak workloads
  • Increased risk of cluster instability

In many environments, yellow cluster health appears after infrastructure changes, node failures, storage limitations, version upgrades, or incorrect cluster settings.

Administrators frequently encounter the issue after building a new cluster, replacing hardware, expanding storage, or performing maintenance activities.

In this guide, you’ll learn:

  • What a yellow cluster status means in Wazuh Indexer
  • How shard allocation works internally
  • Why shards become unassigned
  • The most common root causes behind yellow cluster health
  • How to diagnose allocation failures using cluster APIs
  • Step-by-step methods to restore a healthy green cluster state
  • Best practices to prevent future shard allocation problems

If you’re already dealing with memory-related Indexer issues, you may also find our guide on How to Tune OpenSearch Heap Size to Stop Wazuh High Memory Crashes useful, since JVM pressure is a frequent contributor to shard allocation failures.


Understanding Yellow Cluster Status in Wazuh Indexer

 

What Is Wazuh Indexer?

Wazuh Indexer is the component responsible for storing, indexing, and searching security data generated throughout the Wazuh platform.

Built on OpenSearch, it acts as the centralized data repository where alerts, agent information, vulnerability data, inventory records, and audit logs are stored for fast retrieval and analysis.

Without the indexer, the Wazuh Dashboard would have no searchable data and many core security monitoring features would cease to function.

Role of the Indexer in the Wazuh Architecture

A typical Wazuh deployment consists of three major components:

  1. Wazuh Manager
  2. Wazuh Indexer
  3. Wazuh Dashboard

The workflow generally looks like this:

Endpoints
     │
     ▼
Wazuh Agents
     │
     ▼
Wazuh Manager
     │
     ▼
Wazuh Indexer
     │
     ▼
Wazuh Dashboard

The manager receives and analyzes security events from agents.

Once processed, the resulting alerts are forwarded to the indexer where they are stored inside OpenSearch indices.

The dashboard then queries the indexer whenever users perform searches, view alerts, investigate incidents, or generate reports.

Relationship Between Wazuh Manager, Indexer, and Dashboard

Each component depends on the others:

ComponentPrimary Function
Wazuh ManagerEvent analysis and rule processing
Wazuh IndexerData storage and search
Wazuh DashboardVisualization and management interface

If the indexer develops shard allocation issues, the dashboard may begin showing:

  • Missing data
  • Delayed search results
  • Incomplete visualizations
  • Dashboard loading problems

In severe cases, you may encounter errors similar to those discussed in Wazuh Dashboard Not Loading? Complete Troubleshooting Guide and Troubleshooting “No Matching Indices Found” Error in Wazuh Dashboard.

What Does a Yellow Cluster Status Mean?

OpenSearch clusters report health using three primary states:

StatusMeaning
GreenAll primary and replica shards are assigned
YellowAll primary shards assigned, one or more replica shards unassigned
RedOne or more primary shards are unassigned

A yellow status indicates that the cluster remains operational because all primary data is still available.

However, redundancy is compromised.

If a node containing a primary shard fails while its replica remains unassigned, data availability may be affected until recovery procedures are completed.

OpenSearch officially defines yellow health as a state where primary shards are active but replica allocation has not been fully completed.

Primary Shards vs Replica Shards

Every OpenSearch index is divided into shards.

Primary Shards

Primary shards store the original indexed data.

Every document written to an index is first stored within a primary shard.

Replica Shards

Replica shards are copies of primary shards.

They provide:

  • High availability
  • Improved fault tolerance
  • Additional search capacity
  • Faster recovery during failures

For example:

Index: wazuh-alerts

Primary Shard 1 → Node A
Replica Shard 1 → Node B

If Node A fails, Node B can immediately serve the data.

Why Yellow Status Usually Indicates Replica Allocation Problems

In most Wazuh deployments, yellow status occurs because replica shards cannot be assigned.

Common reasons include:

  • Single-node clusters
  • Offline cluster members
  • Disk space limitations
  • Allocation restrictions
  • Resource exhaustion

Since primary shards remain healthy, the cluster stays functional, but administrators should still resolve the issue to restore redundancy.

What Are Unassigned Shards?

An unassigned shard is a shard that OpenSearch knows should exist but cannot currently place on any node.

The shard remains part of cluster metadata but has no active location.

How Shards Become Unassigned

Common scenarios include:

  • Node crashes
  • Server reboots
  • Storage failures
  • Disk watermark violations
  • Replica allocation conflicts
  • Cluster reconfiguration errors
  • Index corruption

When one of these events occurs, OpenSearch attempts to relocate the shard automatically.

If relocation fails, the shard becomes unassigned.

Impact on Search Performance and Fault Tolerance

Unassigned shards affect cluster behavior in several ways.

Reduced fault tolerance

The biggest concern is the loss of redundancy. If a primary shard fails before a replica becomes available, recovery becomes much more difficult.

Longer recovery times

Clusters with unassigned replicas typically require more time to rebalance after node failures.

Reduced search scalability

Replica shards help distribute query workloads across multiple nodes. Missing replicas force all searches onto fewer resources.

Potential indexing delays

In heavily loaded clusters, allocation issues can increase indexing pressure and slow ingestion rates.

Research from the OpenSearch community and production best-practice guidance consistently emphasizes maintaining fully assigned shards to maximize resilience and cluster performance.


Common Causes of Unassigned Shards in Wazuh Indexer

 

Single-Node Wazuh Deployments with Replica Shards

The most common cause of yellow cluster health is a single-node deployment configured with replica shards.

OpenSearch intentionally prevents replica shards from being placed on the same node as their primary shard.

For example:

Node 1
 ├─ Primary Shard 1
 └─ Replica Shard 1 ❌

Because both copies would reside on the same server, redundancy would provide no protection against node failure.

As a result, replica shards remain unassigned and the cluster reports yellow status.

Why Replicas Cannot Be Allocated on the Same Node

This behavior is by design.

OpenSearch enforces shard allocation awareness to ensure fault tolerance.

A replica must reside on a different node than its primary counterpart.

Common Default Configuration Issues

Administrators often encounter yellow status immediately after:

  • Installing a new single-node Wazuh environment
  • Creating new indices with default replica settings
  • Migrating from a clustered environment to a standalone server

In these cases, reducing the replica count to zero may be appropriate.

Node Failures or Offline Nodes

In multi-node clusters, yellow health frequently occurs when one or more nodes become unavailable.

Cluster Member Outages

Possible causes include:

  • Hardware failures
  • Power outages
  • VM crashes
  • Operating system failures

When a node disappears, OpenSearch attempts to reassign affected shards elsewhere.

If reassignment cannot occur, shards become unassigned.

Network Interruptions

Temporary connectivity problems can cause nodes to leave the cluster unexpectedly.

Examples include:

  • Firewall rule changes
  • DNS failures
  • Routing issues
  • VLAN misconfigurations

Node Communication Problems

OpenSearch relies on continuous cluster communication.

Issues with certificates, transport-layer connectivity, or node discovery can prevent healthy shard allocation.

If you’re troubleshooting certificate-related cluster communication failures, see How to Fix Wazuh Certificate Errors.

Disk Watermark Thresholds Reached

OpenSearch includes protective mechanisms that prevent shard allocation when storage becomes critically low.

This behavior helps avoid data corruption and catastrophic disk exhaustion.

Low Disk Watermark

At the low watermark threshold, OpenSearch begins avoiding allocation to nodes with limited free space.

High Disk Watermark

At the high watermark threshold, OpenSearch actively relocates shards away from overloaded nodes.

Flood-Stage Watermark

At flood-stage levels, indices may become read-only to prevent further storage consumption.

This frequently causes both indexing and allocation problems.

Cluster Allocation Settings Preventing Assignment

Misconfigured cluster settings can unintentionally block shard placement.

Disabled Shard Allocation

Administrators sometimes disable allocation during maintenance operations and forget to re-enable it afterward.

Examples include:

cluster.routing.allocation.enable: none

This immediately prevents shard assignment.

Allocation Filtering Rules

Node filters can restrict where shards are allowed to reside.

Improper filters may eliminate all eligible nodes.

Cluster Routing Restrictions

Routing constraints based on node attributes can also prevent successful allocation.

Insufficient Cluster Resources

Resource exhaustion is another common cause of yellow cluster status.

Memory Pressure

When JVM memory usage becomes excessive, nodes may reject allocation requests or experience instability.

This is one reason why proper heap sizing is critical.

See How to Tune OpenSearch Heap Size to Stop Wazuh High Memory Crashes.

CPU Bottlenecks

High CPU utilization can slow cluster operations, delay rebalancing, and extend shard recovery times.

If system-wide resource pressure exists, review Why Is Wazuh Using High CPU? Troubleshooting Guide.

JVM Heap Limitations

Insufficient heap allocation can cause:

  • Long garbage collection pauses
  • Slow cluster state updates
  • Failed shard recoveries
  • Node instability

Corrupted or Missing Index Data

Occasionally, OpenSearch cannot allocate a shard because the underlying data is damaged.

Unexpected Shutdowns

Abrupt power loss or forced process termination can leave shard data in an inconsistent state.

Filesystem Corruption

Storage corruption may prevent OpenSearch from reading required shard files.

Storage Failures

Failing SSDs, RAID problems, and network storage interruptions can all trigger unassigned shard scenarios.

Version Mismatches Between Cluster Nodes

Mixed software versions can create shard allocation problems after upgrades.

Mixed-Version Clusters

Running different Indexer/OpenSearch versions across nodes may lead to compatibility issues.

While temporary mixed-version states are often supported during rolling upgrades, prolonged version inconsistencies can cause cluster instability.

Upgrade-Related Allocation Issues

Administrators sometimes encounter yellow status after:

  • Partial upgrades
  • Failed upgrades
  • Interrupted migrations
  • Incomplete cluster restarts

When troubleshooting post-upgrade issues, verify that all cluster nodes are running compatible versions and that shard migrations completed successfully.


How to Check Cluster Health

Before attempting to fix unassigned shards, you should first verify the current health of your Wazuh Indexer cluster.

OpenSearch provides several APIs that reveal cluster status, shard allocation information, node availability, and resource utilization.

Checking cluster health helps you determine:

  • Whether the cluster is green, yellow, or red
  • How many shards are affected
  • Whether node failures are involved
  • If the issue is isolated or cluster-wide
  • Which troubleshooting steps should be performed next

Verify Current Cluster Status

The quickest way to check cluster health is using the Cluster Health API:

curl -k -u admin:password \
https://localhost:9200/_cluster/health?pretty

Example output:

{
  "cluster_name": "wazuh-cluster",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 3,
  "active_primary_shards": 120,
  "active_shards": 240,
  "unassigned_shards": 5
}

This command provides a high-level overview of the cluster’s current condition.

If your cluster status is yellow, pay special attention to the number of unassigned shards and available nodes.

Key Fields to Review

status

The status field indicates overall cluster health.

Possible values include:

StatusMeaning
GreenAll primary and replica shards assigned
YellowReplica shards unassigned
RedPrimary shards unavailable

A yellow cluster is generally less urgent than a red cluster because data remains accessible through primary shards.

active_shards

This field shows the total number of active shards currently assigned within the cluster.

A sudden decrease may indicate node failures or allocation issues.

unassigned_shards

This is one of the most important metrics when troubleshooting yellow cluster health.

Example:

"unassigned_shards": 12

Any value greater than zero indicates allocation problems that require investigation.

number_of_nodes

This field displays the number of nodes currently participating in the cluster.

Compare this value against your expected cluster size.

For example:

"number_of_nodes": 2

If you expect three nodes but only see two, a node may be offline or unable to communicate with the cluster.

Check Cluster Health from Wazuh Dashboard

Administrators can also review cluster health through the Wazuh Dashboard.

This approach is useful when API access is unavailable or when performing routine health checks.

Navigating to Indexer Monitoring Views

Depending on your Wazuh version, navigate to:

Dashboard
 └── Index Management
      └── Indices

or

Dashboard
 └── Dev Tools

Some deployments may also expose OpenSearch monitoring dashboards that provide:

  • Cluster health status
  • Node availability
  • Shard allocation statistics
  • JVM utilization
  • Disk usage metrics

Identifying Warning Indicators

Common indicators of shard allocation problems include:

  • Yellow cluster health warnings
  • Unassigned shard alerts
  • Missing index replicas
  • Increased cluster recovery activity
  • Delayed dashboard searches
  • Incomplete visualizations

If dashboard searches are returning incomplete results, you may also encounter symptoms discussed in Troubleshooting “No Matching Indices Found” Error in Wazuh Dashboard.

Similarly, severe allocation issues can eventually contribute to dashboard availability problems covered in Wazuh Dashboard Not Loading? Complete Troubleshooting Guide.


Identify Which Shards Are Unassigned

Once you’ve confirmed that the cluster is yellow, the next step is determining exactly which shards are affected.

The Cluster Health API only reports the number of unassigned shards.

To identify the specific indices involved, you’ll need to inspect shard allocation details.

List Unassigned Shards

Use the Cat Shards API:

curl -k -u admin:password \
https://localhost:9200/_cat/shards?v

Example output:

index                      shard prirep state      node
wazuh-alerts-4.x-2026.06   0     p      STARTED    node-1
wazuh-alerts-4.x-2026.06   0     r      UNASSIGNED

This output provides a detailed view of every shard in the cluster.

Understanding the Output

Several fields are particularly useful when investigating allocation problems.

Index Name

The index column identifies which index owns the shard.

Example:

wazuh-alerts-4.x-2026.06

This helps determine whether the problem affects:

  • Alert indices
  • Vulnerability indices
  • Inventory data
  • Custom indices

Shard Number

The shard field identifies the specific shard ID.

Example:

0
1
2

Multiple unassigned shards from the same index often indicate a broader allocation problem.

Primary vs Replica

The prirep column indicates whether the shard is primary or replica.

Values include:

p = Primary
r = Replica

Most yellow cluster issues involve replica shards.

Example:

r

If primary shards become unassigned, the cluster status typically changes to red.

Current State

The state field displays the shard’s current status.

Examples include:

STARTED
INITIALIZING
RELOCATING
UNASSIGNED

Any shard marked as UNASSIGNED requires further investigation.

Display Only Unassigned Shards

Large clusters may contain hundreds or thousands of shards.

To simplify analysis, display only key fields:

curl -k -u admin:password \
"https://localhost:9200/_cat/shards?h=index,shard,prirep,state,node"

Example output:

wazuh-alerts-4.x-2026.06 0 r UNASSIGNED
wazuh-monitoring-2026.06 1 r UNASSIGNED

This allows you to quickly identify:

  • Which indices are affected
  • Whether the shard is primary or replica
  • The scale of the allocation problem

At this stage, you’ve identified what is unassigned.

The next step is determining why OpenSearch refuses to allocate those shards.


Use the Cluster Allocation Explain API

The Cluster Allocation Explain API is one of the most valuable troubleshooting tools available in OpenSearch.

Rather than guessing why shards remain unassigned, this API provides detailed allocation decisions generated by the cluster’s allocation engine.

In many cases, the explanation immediately identifies the root cause.

Determine Why a Shard Cannot Be Assigned

Run the following command:

curl -k -u admin:password \
-X GET "https://localhost:9200/_cluster/allocation/explain?pretty"

Example response:

{
  "index": "wazuh-alerts-4.x-2026.06",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "can_allocate": "no"
}

The output may also include detailed node-by-node allocation decisions that explain why OpenSearch rejected candidate nodes.

This API is considered one of the primary diagnostic tools by both the OpenSearch and Elasticsearch engineering teams when investigating shard allocation failures.

How to Interpret Allocation Explanations

The most important information appears in the allocation decision section.

NO Decisions

A NO decision means OpenSearch found no eligible node for shard placement.

Example causes include:

  • Disk thresholds exceeded
  • Allocation disabled
  • Node attribute restrictions
  • Missing nodes
  • Replica conflicts

Example:

"can_allocate": "no"

These situations typically require configuration changes before allocation can proceed.

THROTTLE Decisions

A THROTTLE decision indicates that allocation is allowed but temporarily delayed.

This commonly occurs when:

  • Large recoveries are already running
  • Cluster rebalancing is active
  • Recovery limits have been reached

Example:

"can_allocate": "throttle"

In many cases, waiting for existing recoveries to finish resolves the issue automatically.

Disk-Related Restrictions

The allocation explanation may indicate that a node exceeds configured disk watermarks.

Example message:

node exceeds high disk watermark

When this occurs, OpenSearch blocks new shard allocations until sufficient storage is freed.

If disk pressure is widespread across the cluster, additional storage or node expansion may be required.

Replica Allocation Conflicts

One of the most common explanations in Wazuh environments is:

cannot allocate replica shard to same node as primary shard

This typically occurs when:

  • A single-node deployment uses replicas
  • Too few nodes exist for the configured replica count
  • One or more cluster nodes are offline

For example:

Cluster Nodes: 1
Replicas: 1

Because the replica cannot be stored on the same node as its primary shard, OpenSearch leaves the replica unassigned and reports yellow cluster health.

Once you’ve identified the allocation reason, you can begin applying targeted fixes rather than relying on trial-and-error troubleshooting.


Fix 1: Reduce Replica Count in Single-Node Deployments

For single-node Wazuh installations, unassigned replica shards are by far the most common reason for a yellow cluster status.

In many cases, nothing is actually broken. OpenSearch is simply unable to place replica shards because no additional nodes exist within the cluster.

If your deployment contains only one Indexer node, reducing the replica count to zero is often the correct solution.

Why Replica Shards Cause Yellow Status

OpenSearch is designed to maintain redundancy by storing replica shards on different nodes than their corresponding primary shards.

When a cluster contains only a single node, this requirement cannot be satisfied.

As a result:

  • Primary shards remain assigned
  • Replica shards remain unassigned
  • Cluster health becomes yellow

Replica Allocation Requirements

Consider the following example:

Node-1
 ├── Primary Shard 0
 └── Replica Shard 0 ❌

OpenSearch intentionally prevents the replica from being placed on the same node.

The allocation engine treats this as an invalid configuration because both copies would be lost if the server failed.

According to OpenSearch shard allocation guidelines, replica shards must always reside on separate nodes to provide meaningful fault tolerance.

Single-Node Limitations

A single-node cluster can safely operate without replicas.

However, administrators should understand the trade-off:

ConfigurationFault Tolerance
1 Primary + 1 ReplicaHigh
1 Primary + 0 ReplicasNone

For labs, test environments, proof-of-concept deployments, and small standalone servers, setting replicas to zero is generally acceptable.

Production environments should typically use multiple Indexer nodes instead.

Check Current Replica Configuration

Before making changes, verify the current replica settings.

Run:

curl -k -u admin:password \
"https://localhost:9200/_all/_settings?pretty"

Look for:

"number_of_replicas": "1"

or

"number_of_replicas": "2"

If replicas are configured but only one node exists in the cluster, yellow status is expected.

Set Replicas to Zero

Update all indices:

curl -k -u admin:password \
-X PUT "https://localhost:9200/*/_settings" \
-H 'Content-Type: application/json' \
-d '{
  "index": {
    "number_of_replicas": 0
  }
}'

Expected response:

{
  "acknowledged": true
}

This instructs OpenSearch to stop expecting replica shards.

The previously unassigned replicas will disappear from cluster health calculations.

Verify Cluster Returns to Green

After applying the change, verify cluster health:

curl -k -u admin:password \
https://localhost:9200/_cluster/health?pretty

Expected output:

{
  "status": "green"
}

If the cluster remains yellow, another allocation issue is likely present and additional troubleshooting is required.


Fix 2: Restore Offline Indexer Nodes

In multi-node deployments, yellow cluster health frequently occurs because one or more Indexer nodes have gone offline.

When a node leaves the cluster unexpectedly, OpenSearch may be unable to allocate replica shards until the missing node returns or replacement capacity becomes available.

Check Cluster Nodes

First, determine which nodes are currently participating in the cluster.

Run:

curl -k -u admin:password \
"https://localhost:9200/_cat/nodes?v"

Example output:

ip            heap.percent ram.percent cpu load_1m node.role master name
10.0.0.10     45           70          12  0.32    dimr      *      node-1
10.0.0.11     52           68          18  0.40    dimr      -      node-2

Compare the results against your expected cluster design.

For example:

Expected NodesDetected Nodes
32

If a node is missing, that is likely the cause of the unassigned shards.

Verify Node Connectivity

If nodes are missing, investigate communication issues before modifying cluster settings.

Network Communication Checks

Confirm basic connectivity between Indexer servers:

ping node-2

or

telnet node-2 9300

Connectivity failures may indicate routing or network problems.

Firewall Verification

Ensure firewall rules allow OpenSearch transport traffic between cluster members.

Common issues include:

  • Newly applied firewall policies
  • Cloud security group changes
  • VLAN segmentation changes
  • Internal ACL restrictions

OpenSearch transport traffic must be permitted between all cluster nodes.

TLS Configuration Validation

TLS certificate problems can prevent nodes from joining the cluster.

Review:

  • Node certificates
  • Certificate expiration dates
  • Trusted CA configuration
  • Transport layer TLS settings

Certificate-related cluster communication failures are discussed in detail in How to Fix Wazuh Certificate Errors.

Restart Failed Indexer Services

If connectivity appears normal but a node remains offline, restart the Indexer service.

Linux Systems

On Linux systems:

systemctl restart wazuh-indexer

Monitor logs during startup:

journalctl -u wazuh-indexer -f

Watch for:

  • Cluster join failures
  • TLS errors
  • Memory allocation issues
  • Corrupted index warnings

Verify Service Status

After restarting, confirm the service is running correctly:

systemctl status wazuh-indexer

Example healthy output:

Active: active (running)

Once the node rejoins the cluster, OpenSearch should begin allocating previously unassigned replica shards automatically.

If memory-related crashes caused the outage, review How to Tune OpenSearch Heap Size to Stop Wazuh High Memory Crashes.


Fix 3: Resolve Disk Watermark Issues

Disk usage is one of the most common causes of unassigned shards in production Wazuh clusters.

To prevent storage exhaustion and potential data corruption, OpenSearch enforces disk watermark thresholds that restrict shard allocation when storage becomes critically low.

When these thresholds are exceeded, replica shards often become unassigned and cluster health changes to yellow.

Check Disk Usage Across Nodes

Begin by reviewing allocation and storage usage:

curl -k -u admin:password \
"https://localhost:9200/_cat/allocation?v"

Example output:

shards disk.indices disk.used disk.avail disk.total disk.percent host
120    450gb        780gb     20gb       800gb      97          node-1

Pay particular attention to:

  • disk.percent
  • disk.avail
  • disk.used

Nodes approaching capacity are common candidates for allocation failures.

Understand Watermark Thresholds

OpenSearch uses three primary disk watermarks.

Low Watermark

Default: approximately 85% disk utilization.

Once exceeded, OpenSearch begins avoiding new shard allocations on the affected node.

High Watermark

Default: approximately 90% disk utilization.

At this threshold, OpenSearch actively relocates shards away from the node whenever possible.

Flood Stage Watermark

Default: approximately 95% disk utilization.

At this stage:

  • Indices may become read-only
  • Indexing operations may fail
  • Allocation restrictions become much more aggressive

OpenSearch engineers strongly recommend maintaining sufficient free space to avoid reaching flood-stage conditions.

Free Disk Space

If disk watermarks are preventing allocation, the safest solution is to reduce storage consumption.

Delete Old Indices

Many Wazuh environments retain historical alert data longer than necessary.

Review older indices:

curl -k -u admin:password \
"https://localhost:9200/_cat/indices?v"

Delete unused historical data after confirming retention requirements.

Before removing data, review your retention strategy in How to Configure Wazuh Log Retention.

Expand Storage

If data growth is expected to continue, increasing storage capacity is often preferable to deleting data.

Options include:

  • Expanding virtual disks
  • Adding larger SSDs
  • Migrating to larger storage volumes
  • Adding additional Indexer nodes

Archive Historical Data

Older security data may be archived to external storage instead of remaining in the active cluster.

Common destinations include:

  • Object storage
  • Backup repositories
  • Long-term compliance archives

This approach preserves historical visibility while reducing Indexer storage pressure.

Temporarily Adjust Watermark Settings

If immediate allocation is required while storage remediation is underway, you can temporarily increase watermark thresholds.

curl -k -u admin:password \
-X PUT "https://localhost:9200/_cluster/settings" \
-H 'Content-Type: application/json' \
-d '{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%"
  }
}'

This may allow shard allocation to proceed temporarily.

However, this should be considered a short-term workaround rather than a permanent solution.

Running clusters near full capacity increases the risk of:

  • Performance degradation
  • Allocation instability
  • Index corruption
  • Flood-stage lockouts

After freeing space or expanding storage, restore watermark values to their recommended settings and verify that all shards have been successfully assigned.


Fix 4: Re-Enable Shard Allocation

Administrators often disable shard allocation temporarily during maintenance activities, cluster upgrades, node replacements, or migration operations.

If allocation is not re-enabled afterward, replica shards remain unassigned and the cluster continues reporting a yellow status.

Fortunately, this is one of the easiest allocation problems to diagnose and resolve.

Check Current Allocation Settings

Begin by reviewing the cluster’s allocation configuration:

curl -k -u admin:password \
"https://localhost:9200/_cluster/settings?pretty"

Look for settings similar to:

{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation": {
          "enable": "none"
        }
      }
    }
  }
}

or

"cluster.routing.allocation.enable": "primaries"

Common values include:

SettingMeaning
allAllocate all shard types
primariesAllocate only primary shards
new_primariesAllocate newly created primaries only
noneDisable all shard allocation

If the setting is anything other than all, it may explain why replica shards remain unassigned.

Enable Allocation

Restore normal shard allocation using:

curl -k -u admin:password \
-X PUT "https://localhost:9200/_cluster/settings" \
-H 'Content-Type: application/json' \
-d '{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}'

Expected response:

{
  "acknowledged": true
}

This instructs OpenSearch to resume normal allocation and rebalancing operations across the cluster.

Verify Allocation Resumes

After enabling allocation, monitor cluster recovery progress:

curl -k -u admin:password \
"https://localhost:9200/_cluster/health?pretty"

You should see:

  • Decreasing unassigned shard counts
  • Increasing active shard counts
  • Ongoing recovery activity

You can also monitor shard movement:

curl -k -u admin:password \
"https://localhost:9200/_cat/recovery?v"

Example:

index                  shard time type stage
wazuh-alerts-4.x       0     10s  peer done

If allocation resumes successfully, cluster health should eventually transition from yellow to green.

After maintenance windows, always verify that allocation has been re-enabled before returning the cluster to production service.


Fix 5: Address Memory and JVM Pressure

Insufficient memory is another common reason shards remain unassigned.

Even when disk space and node availability are healthy, OpenSearch may refuse allocations if nodes are experiencing excessive JVM pressure.

High heap utilization can slow cluster state updates, delay shard recoveries, and cause node instability.

Check JVM Heap Utilization

Start by reviewing JVM statistics:

curl -k -u admin:password \
"https://localhost:9200/_nodes/stats/jvm?pretty"

Example output:

{
  "heap_used_percent": 87,
  "heap_max_in_bytes": 8589934592
}

Pay particular attention to:

  • heap_used_percent
  • heap_max_in_bytes
  • garbage collection statistics
  • memory pool utilization

Clusters consistently operating near heap limits frequently experience allocation problems.

Identify Memory Bottlenecks

Several warning signs indicate memory-related allocation issues.

High Heap Utilization

A healthy OpenSearch cluster generally maintains enough free heap space to handle indexing, searches, and shard recovery operations.

Warning signs include:

Heap Usage > 85%

Sustained utilization above this threshold often correlates with performance degradation.

Frequent Garbage Collection

Excessive garbage collection activity can indicate memory pressure.

Symptoms include:

  • Slow queries
  • Delayed indexing
  • Cluster state update delays
  • Recovery timeouts

Example log entries:

[gc][young] duration [3.4s]

or

[gc][old] duration [12.7s]

Long GC pauses can temporarily prevent allocation activities from completing.

Out-of-Memory Conditions

The most severe scenario involves JVM crashes.

Look for errors such as:

java.lang.OutOfMemoryError

or

heap space exhausted

Nodes experiencing OOM events may leave the cluster entirely, creating additional unassigned shards.

Recommended Heap Configuration

Proper heap sizing is one of the most important aspects of Indexer stability.

Sizing Guidelines

General OpenSearch recommendations include:

Server RAMRecommended Heap
8 GB4 GB
16 GB8 GB
32 GB16 GB
64 GB31 GB

OpenSearch engineers generally recommend allocating approximately 50% of available system memory to the JVM while leaving sufficient RAM for the operating system and filesystem cache.

Heap Tuning Best Practices

For Wazuh Indexer environments:

  • Keep Xms and Xmx identical
  • Avoid heap sizes larger than 31–32 GB
  • Monitor garbage collection behavior regularly
  • Review shard counts periodically
  • Scale nodes before memory becomes constrained

For a deeper discussion of heap sizing and memory optimization, see How to Tune OpenSearch Heap Size to Stop Wazuh High Memory Crashes.

If excessive CPU usage accompanies memory pressure, you may also find Why Is Wazuh Using High CPU? Troubleshooting Guide useful.

After addressing memory issues, OpenSearch may automatically resume shard allocation once sufficient resources become available.


Fix 6: Force a Shard Reroute

In some situations, the original allocation problem has already been resolved, but OpenSearch has not yet retried assigning affected shards.

When this occurs, manually triggering a reroute can accelerate recovery.

When Manual Rerouting Is Appropriate

Rerouting should generally be considered after the root cause has been fixed.

Examples include:

  • A failed node has rejoined the cluster
  • Disk space has been reclaimed
  • Allocation settings have been restored
  • Memory pressure has subsided
  • Network connectivity has been repaired

A reroute should not be used as a substitute for correcting the underlying problem.

Recovery Scenarios

Common situations where rerouting helps include:

  • Cluster recovery after maintenance
  • Node replacement projects
  • Storage expansion events
  • Recovery from temporary outages
  • Post-upgrade shard recovery

For example:

Problem Fixed
      ↓
Shards Still Unassigned
      ↓
Trigger Reroute
      ↓
Allocation Resumes

Post-Maintenance Allocation Issues

Administrators often encounter lingering unassigned shards after:

  • Cluster restarts
  • Rolling upgrades
  • Node migrations
  • Configuration changes

Even after the environment becomes healthy again, some allocation decisions may need to be retried.

Retry Failed Allocations

Trigger a reroute using:

curl -k -u admin:password \
-X POST "https://localhost:9200/_cluster/reroute?retry_failed=true"

Expected response:

{
  "acknowledged": true
}

The retry_failed parameter instructs OpenSearch to revisit shards that were previously rejected during allocation attempts.

This often resolves allocation delays after transient failures.

Verify Recovery Progress

After initiating the reroute, monitor cluster health:

curl -k -u admin:password \
"https://localhost:9200/_cluster/health?pretty"

Watch for improvements in:

{
  "status": "green",
  "unassigned_shards": 0
}

You can also monitor ongoing recovery operations:

curl -k -u admin:password \
"https://localhost:9200/_cat/recovery?v"

Additionally, verify shard placement:

curl -k -u admin:password \
"https://localhost:9200/_cat/shards?v"

Successful recovery typically follows this sequence:

UNASSIGNED
      ↓
INITIALIZING
      ↓
RELOCATING
      ↓
STARTED

Once all shards reach the STARTED state and no unassigned shards remain, cluster health should return to green and full redundancy will be restored.


Fix 7: Recover Corrupted Indices

Although less common than disk, memory, or allocation configuration issues, index corruption can also leave shards permanently unassigned.

When OpenSearch cannot read shard metadata or underlying segment files, allocation attempts may repeatedly fail even though cluster resources are otherwise healthy.

In these situations, administrators must identify the corrupted index and either restore it from backup or rebuild it.

Detect Corrupted Indexes

The first step is determining whether corruption is actually responsible for the allocation failure.

The Cluster Allocation Explain API often reveals corruption-related errors, but Indexer logs usually provide the most detailed information.

Reviewing Indexer Logs

Review Indexer logs on affected nodes:

journalctl -u wazuh-indexer -f

or

tail -f /var/log/wazuh-indexer/wazuh-cluster.log

Look for repeated shard recovery failures involving the same index.

Common Corruption Indicators

Common warning messages include:

corrupt index
failed shard recovery
corrupted segment file
checksum mismatch
translog corruption detected

Potential causes include:

  • Unexpected server shutdowns
  • Storage controller failures
  • Filesystem corruption
  • Faulty disks
  • Incomplete writes during crashes

According to OpenSearch engineering guidance, storage integrity problems should be investigated immediately because corruption may indicate underlying hardware failures rather than isolated software issues.

Restore from Snapshot

If a recent snapshot exists, restoration is typically the safest recovery method.

Snapshots preserve:

  • Index mappings
  • Documents
  • Settings
  • Shard metadata

Restoring from a known-good backup often resolves corruption without requiring manual reconstruction.

Snapshot Recovery Process

A typical recovery workflow looks like:

Identify Corrupted Index
          ↓
Delete Damaged Index
          ↓
Restore Snapshot
          ↓
Verify Recovery

Example restore command:

curl -k -u admin:password \
-X POST "https://localhost:9200/_snapshot/repository/snapshot_name/_restore"

The exact syntax depends on your snapshot repository configuration.

Validation Steps

After restoration:

  1. Verify index health.
  2. Check shard assignment.
  3. Confirm document counts.
  4. Validate dashboard searches.
  5. Review cluster health.

Run:

curl -k -u admin:password \
"https://localhost:9200/_cluster/health?pretty"

and verify that unassigned shard counts decrease.

Rebuild Problematic Indices

If no usable snapshot exists, rebuilding the index may be necessary.

This is generally considered a last-resort option.

Last-Resort Recovery Option

The recovery process usually involves:

Delete Corrupted Index
         ↓
Create New Index
         ↓
Reingest Data
         ↓
Validate Searches

Depending on the affected index, data may be regenerated from:

  • Wazuh agents
  • Log sources
  • External SIEM feeds
  • Historical archives

Potential Data-Loss Considerations

Before deleting a corrupted index, understand the consequences.

Possible outcomes include:

  • Loss of historical alerts
  • Missing vulnerability records
  • Incomplete compliance data
  • Reduced forensic visibility

For this reason, maintaining regular backups is critical.

If retention and recovery planning are part of your environment, review How to Configure Wazuh Log Retention for additional guidance.


Monitoring Cluster Recovery

After applying a fix, you should continuously monitor the cluster until all shards have been successfully assigned.

Even when the root cause has been resolved, large clusters may require time to relocate shards, perform recoveries, and rebalance workloads.

Monitoring progress helps confirm that corrective actions are actually working.

Track Cluster Health Changes

The simplest approach is to repeatedly query cluster health.

On Linux systems:

watch -n 10 'curl -k -u admin:password https://localhost:9200/_cluster/health?pretty'

This refreshes cluster health every ten seconds.

Pay particular attention to:

  • status
  • active_shards
  • relocating_shards
  • initializing_shards
  • unassigned_shards

A healthy recovery generally shows:

Unassigned Shards ↓
Initializing Shards ↑
Active Shards ↑

Monitor Shard Allocation Progress

To track individual shard recoveries:

curl -k -u admin:password \
"https://localhost:9200/_cat/recovery?v"

Example output:

index               shard stage
wazuh-alerts        0     init
wazuh-alerts        1     done

Recovery stages commonly include:

StageMeaning
initRecovery started
indexSegment transfer in progress
verify_indexIntegrity validation
translogTransaction log replay
doneRecovery complete

Large clusters containing terabytes of data may require hours to fully recover.

Confirm Green Cluster Status

A cluster should not be considered fully healthy until it returns to green.

Run:

curl -k -u admin:password \
"https://localhost:9200/_cluster/health?pretty"

Expected output:

{
  "status": "green"
}

All Shards Assigned

Verify that no unassigned shards remain:

"unassigned_shards": 0

This confirms successful allocation.

Replica Allocation Completed

Confirm replica shards are active:

curl -k -u admin:password \
"https://localhost:9200/_cat/shards?v"

Replica entries should display:

STARTED

rather than:

UNASSIGNED

No Allocation Warnings

Finally, ensure that:

  • Allocation Explain API reports no issues
  • Cluster logs are free of allocation errors
  • Dashboard monitoring shows healthy status
  • Searches return expected results

Once these checks are complete, the recovery process can be considered successful.


Best Practices to Prevent Future Unassigned Shards

Most shard allocation issues are preventable through proactive cluster management.

The following best practices help maintain healthy Wazuh Indexer environments and reduce the likelihood of future yellow cluster states.

Size Clusters Appropriately

Many allocation issues originate from undersized infrastructure.

When planning a cluster, consider:

  • Daily log volume
  • Agent count
  • Retention requirements
  • Search workload
  • Future growth projections

Avoid running production deployments with minimal hardware resources.

Monitor Disk Usage Before Watermarks Trigger

Waiting until disks exceed watermark thresholds often results in emergency remediation efforts.

Instead:

  • Monitor storage utilization continuously
  • Configure alerts at 70–80% utilization
  • Expand capacity before critical thresholds are reached

Proactive storage management significantly reduces allocation-related incidents.

Regularly Review Cluster Health

Cluster health checks should become part of routine operational procedures.

Recommended monitoring includes:

_cluster/health
_cat/shards
_cat/allocation

Early detection allows administrators to resolve minor issues before they escalate.

Use Appropriate Replica Counts

Replica settings should match cluster architecture.

Examples:

Cluster SizeRecommended Replicas
1 Node0
2 Nodes1
3+ Nodes1 or more depending on redundancy goals

Overly aggressive replica counts can create unnecessary allocation pressure.

Maintain Consistent Node Versions

Mixed-version clusters frequently create allocation complications.

Best practices include:

  • Upgrade all nodes promptly
  • Follow supported upgrade paths
  • Avoid prolonged mixed-version deployments
  • Validate compatibility before upgrades

This reduces the risk of shard recovery failures after maintenance.

Configure Snapshot Backups

Snapshots are one of the most effective protections against index corruption and data loss.

A comprehensive backup strategy should include:

  • Scheduled snapshots
  • Offsite storage
  • Recovery testing
  • Retention policies

OpenSearch experts consistently recommend snapshots as the primary recovery mechanism for catastrophic index failures.

Monitor JVM Heap and Garbage Collection

Heap pressure is often an early warning sign of future allocation problems.

Monitor:

  • Heap utilization
  • GC frequency
  • GC duration
  • Node memory consumption

If memory usage trends upward over time, address the issue before allocation failures occur.

Plan Capacity Growth Ahead of Time

Successful Wazuh deployments rarely remain static.

Agent counts increase, log volumes grow, and retention requirements expand.

Capacity planning should include:

  • Storage growth forecasting
  • Heap sizing reviews
  • Node expansion planning
  • Performance trend analysis

Organizations that regularly evaluate future resource requirements experience significantly fewer allocation-related outages than those that react only after problems appear.

By combining proactive monitoring, proper sizing, regular backups, and disciplined maintenance procedures, administrators can dramatically reduce the likelihood of encountering yellow cluster health and unassigned shard issues in the future.


Frequently Asked Questions (FAQ)

 

Question: What causes yellow cluster status in Wazuh Indexer?

A yellow cluster status occurs when all primary shards are available but one or more replica shards remain unassigned.

The most common causes include:

  • Single-node deployments with replica shards enabled
  • Offline Indexer nodes
  • Disk watermark threshold violations
  • Disabled shard allocation
  • JVM memory pressure
  • Network communication issues
  • Corrupted indices
  • Version mismatches between cluster nodes

In most environments, replica allocation failures are responsible for the majority of yellow cluster health incidents.

Question: Is a yellow cluster status dangerous?

A yellow cluster status is not immediately critical because all primary shards remain available and searchable.

However, it should not be ignored.

A yellow cluster indicates reduced fault tolerance. If a node containing primary shards fails before replica shards are assigned, data availability may be affected.

While less severe than a red cluster, yellow status should still be investigated and resolved as soon as practical.

Question: Can Wazuh function normally with a yellow cluster?

In many cases, yes.

The Wazuh Manager, Indexer, and Dashboard generally continue operating because primary shards remain available.

However, administrators may experience:

  • Reduced redundancy
  • Slower search performance
  • Longer recovery times during failures
  • Increased risk of data unavailability if another node fails

The cluster may appear healthy to end users while still carrying significant operational risk.

Question: Why are replica shards unassigned in a single-node deployment?

OpenSearch does not allow a replica shard to be stored on the same node as its primary shard.

For example:

Node-1
 ├── Primary Shard
 └── Replica Shard ❌

Because there is no second node available, replica allocation becomes impossible.

As a result:

  • Primary shards remain active
  • Replica shards remain unassigned
  • Cluster health becomes yellow

For standalone Wazuh deployments, setting the replica count to zero is typically the correct solution.

Question: How do I identify which shards are unassigned?

The easiest method is using the Cat Shards API:

curl -k -u admin:password \
"https://localhost:9200/_cat/shards?v"

You can also display only relevant fields:

curl -k -u admin:password \
"https://localhost:9200/_cat/shards?h=index,shard,prirep,state,node"

Look for entries with:

UNASSIGNED

The output will identify:

  • Affected indices
  • Shard numbers
  • Primary or replica designation
  • Current allocation status

Question: What is the fastest way to return a cluster to green?

The answer depends entirely on the root cause.

Examples include:

CauseFix
Single-node replicasSet replicas to zero
Offline nodeRestore node connectivity
Disk watermarksFree storage space
Disabled allocationRe-enable shard allocation
JVM pressureResolve memory bottlenecks

The fastest way to identify the correct fix is usually the Cluster Allocation Explain API:

curl -k -u admin:password \
-X GET "https://localhost:9200/_cluster/allocation/explain?pretty"

This API typically reveals the exact reason allocation is failing.

Question: Can low disk space cause unassigned shards?

Yes.

Low disk space is one of the most common causes of unassigned shards in production environments.

When configured disk watermarks are exceeded, OpenSearch may:

  • Block new allocations
  • Relocate existing shards
  • Mark indices read-only
  • Prevent replica recovery

Administrators should monitor storage utilization long before disk watermarks are reached.

Question: Should I manually reroute unassigned shards?

Manual rerouting can be useful, but only after the root cause has been fixed.

For example:

  • A node has returned to service
  • Disk space has been freed
  • Allocation settings have been corrected

In these cases:

curl -k -u admin:password \
-X POST "https://localhost:9200/_cluster/reroute?retry_failed=true"

may accelerate recovery.

However, rerouting should not be used as a substitute for addressing the underlying allocation problem.

Question: What is the difference between yellow and red cluster status?

The difference is based on which shard types are unavailable.

StatusMeaning
GreenAll shards assigned
YellowReplica shards unassigned
RedPrimary shards unassigned

A yellow cluster still has access to all primary data.

A red cluster indicates that some primary shards are unavailable, which may result in missing or inaccessible data.

Red cluster status generally requires more urgent intervention.

Question: Will restarting Wazuh Indexer fix unassigned shards?

Sometimes, but not always.

Restarting the Indexer may help if the issue involves:

  • Temporary node failures
  • Hung allocation processes
  • Short-lived communication problems

However, restarting will not fix:

  • Replica allocation conflicts
  • Disk watermark violations
  • Disabled allocation settings
  • Corrupted indices
  • Improper replica configurations

Before restarting services, administrators should determine the actual allocation failure reason using the Cluster Allocation Explain API.


Conclusion

A yellow cluster status in Wazuh Indexer indicates that replica shards cannot be assigned, leaving the cluster operational but without full redundancy.

While the system typically continues processing alerts and serving dashboard queries, unresolved unassigned shards increase operational risk and reduce the cluster’s ability to withstand future failures.

The most common causes include:

  • Single-node deployments with replica shards enabled
  • Offline or unreachable Indexer nodes
  • Disk watermark threshold violations
  • Disabled shard allocation settings
  • JVM memory pressure
  • Corrupted indices
  • Cluster version inconsistencies

The key to resolving yellow cluster health efficiently is identifying the exact reason OpenSearch is refusing shard allocation rather than relying on trial-and-error troubleshooting.

A recommended troubleshooting workflow is:

Check Cluster Health
         ↓
Identify Unassigned Shards
         ↓
Run Allocation Explain API
         ↓
Determine Root Cause
         ↓
Apply Targeted Fix
         ↓
Monitor Recovery
         ↓
Verify Green Status

By using tools such as:

  • _cluster/health
  • _cat/shards
  • _cat/allocation
  • _cluster/allocation/explain

administrators can quickly diagnose and resolve most shard allocation problems.

Long-term stability depends on proactive cluster management.

Proper sizing, adequate storage capacity, regular health monitoring, consistent node versions, JVM tuning, and reliable snapshot backups all play critical roles in preventing future allocation failures.

If you’re building or maintaining a production Wazuh deployment, consider reviewing How to Build a Wazuh Indexer ClusterHow to Tune OpenSearch Heap Size to Stop Wazuh High Memory Crashes, and How to Configure Wazuh Log Retention to further improve Indexer reliability and resilience.

With the right monitoring practices and troubleshooting methodology, you can keep your Wazuh Indexer cluster healthy, maintain full shard redundancy, and prevent yellow cluster status issues from disrupting security operations.

Be First to Comment

    Leave a Reply

    Your email address will not be published. Required fields are marked *