Common issues of the cluster log component

1. Issue Phenomena

In the UK8S service console, cluster application center, ELK log page, after launching the cluster log plugin and using it for a while, you may encounter the following issues:

No latest logs on the ELK log search page
ELK log, component status page shows 0 logs in the past 10 minutes

2. Issue Troubleshooting Reference

The ELK log is deployed in the cluster default namespace by default. If deployed in a custom namespace, replace default with the custom namespace when executing commands.

step 1. View the logstash component logs, log into the cluster master nodes, and execute the command: kubectl logs -f uk8s-elk-release-logstash-0 -n default You can see the following information being printed continuously:


[2021-11-16T09:55:31,753][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"index [uk8s-vidxqjoo-kube-system-2021.11.16] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
[2021-11-16T09:55:31,753][INFO ][logstash.outputs.elasticsearch] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>1}

step 2. View the ES component storage volume usage rate, log into the cluster master node, and execute the command:


for pod in multi-master-0 multi-master-1 multi-master-2
do
  kubectl exec -t -i $pod -- sh -c 'df -h| grep /usr/share/elasticsearch/data' -n default
done

You can see the disk space usage rate as high as 96%


/dev/vdb         20G   19G  933M  96% /usr/share/elasticsearch/data
/dev/vdb         20G   19G  939M  96% /usr/share/elasticsearch/data
/dev/vdc         20G   19G  933M  96% /usr/share/elasticsearch/data

step 3. Query the index status via ES API, log into the cluster master node, and execute the command:


ES_CLUSTER_IP=`kubectl get svc multi-master | awk 'NR>1 {print $3}'`
curl http://${ES_CLUSTER_IP}:9200/_all/_settings?pretty

You can see the returned information includes "read_only_allow_delete": "true" From here, you can determine the cause of the failure. Although the disk is not full, it triggers the ES protection mechanism:

ES cluster.routing.allocation.disk.watermark.low control the low watermark for disk usage. The default value is 85%. If exceeded, es will no longer allocate shards to this node;
ES cluster.routing.allocation.disk.watermark.high control the high watermark. The default value is 90%. If exceeded, it will attempt to relocate shards to other nodes;
ES cluster.routing.allocation.disk.watermark.flood_stage To control the flood stage watermark. The default value is 95%. If exceeded, the ES cluster will forcibly mark all indexes as read-only, causing new logs to fail to be collected, and the latest logs can’t be queried. To recover, you can only manually set index.blocks.read_only_allow_delete to false.

3. Recommended Solutions

3.1 ES PVC Expansion

The ELK log is deployed in the cluster default namespace by default. If deployed in a custom namespace, please replace default with the custom namespace when executing commands.

Step 1. Log into the Master node, execute the command: kubectl get pvc -n default to view PVC, the following named PVCs are used by ES


multi-master-multi-master-0
multi-master-multi-master-1
multi-master-multi-master-2

Execute kubectl edit pvc {pvc-name} -n default, increase the value of spec.resource.requests.storage, save and exit, in about a minute or so, the PV, PVC, and the file system in the container will have completed the online expansion. For more detailed operations, refer to UDisk Dynamic Expansion.

After expansion, confirm the status of PV/PVC: kubectl get pv | grep multi-master && kubectl get pvc | grep multi-master

Step 2. Release the ES index read-only mode


ES_CLUSTER_IP=`kubectl get svc multi-master | awk 'NR>1 {print $3}'`
curl -H "Content-Type: application/json" -XPUT http://${ES_CLUSTER_IP}:9200/_all/_settings -d '{ "index.blocks.read_only_allow_delete": false }'

Step 3. Confirm the ES cluster status


ES_CLUSTER_IP=`kubectl get svc multi-master | awk 'NR>1 {print $3}'` 
curl http://${ES_CLUSTER_IP}:9200/_cat/allocation?pretty
curl http://${ES_CLUSTER_IP}:9200/_cat/health
curl http://${ES_CLUSTER_IP}:9200/_all/_settings | jq

3.2 Adjust ES Configurations

If the current ES PVC capacity is very large, according to ES’s default configuration, 90% storage still leaves a lot of spare space. You can increase ES’s parameter thresholds and release the index from read-only mode , recovering the ES cluster to its normal state.


ES_CLUSTER_IP=`kubectl get svc multi-master | awk 'NR>1 {print $3}'`
 
curl -H "Content-Type: application/json" -XPUT http://${ES_CLUSTER_IP}:9200/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "97%",
    "cluster.info.update.interval": "1m"
  }
}'
curl -H "Content-Type: application/json" -XPUT http://${ES_CLUSTER_IP}:9200/_all/_settings -d '{ 
  "index.blocks.read_only_allow_delete": false
}'

cluster.routing.allocation.disk.watermark.low, controls the low watermark of disk usage. It defaults to 85%, which means Elasticsearch will not allocate shards to nodes where the storage space usage exceeds 85%. It can also be set to an absolute byte value (like 500MB) to prevent Elasticsearch from allocating shards when the available space is less than the specified amount. This setting does not affect the primary shards of newly created indices and in particular, any shards that have never been allocated before.
cluster.routing.allocation.disk.watermark.high, controls the high watermark. It defaults to 90%, which means Elasticsearch will attempt to relocate shards from nodes where the storage usage is more than 90%. It can also be set to an absolute byte value (like the low watermark) to relocate shards away from a node where the available space is less than a specified amount. This setting affects the allocation of all shards, whether they were previously allocated or not.
cluster.routing.allocation.disk.watermark.flood_stage, controls the flood stage watermark. It defaults to 95%, once an ES node’s storage space exceeds the flood stage, Elasticsearch will apply a read-only setting to index blocks index.blocks.read_only_allow_delete: true This is the last resort to prevent the node from running out of storage space. Once there is sufficient space for indexing operations to continue, you must manually adjust index.blocks.read_only_allow_delete: false to cancel the index’s read-only attributes.

Reference: Elasticsearch Official Documentation

4. Issue Troubleshooting Reference

Memory Adjustment Guide

The log ELK is deployed in the default namespace by default. If deployed in a custom namespace, replace the corresponding namespace name.

Preconditions

This guide applies to scenarios where the log service is self-built based on UK8S. Ensure the following conditions are met:

The service has been enabled in UK8S Console > Details > Application Center > Log ELK with the installation type set to New.
Verify the service status with the command:


kubectl get sts multi-master -n default

If the command output does not show error, the service is running normally.

Modify Memory Configuration


kubectl patch sts multi-master -n default --patch '
spec:
  template:
    spec:
      containers:
      - name: elasticsearch
        env:
        - name: ES_JAVA_OPTS
          value: "-Xmx4g -Xms4g"
        resources:
          limits:
            memory: "8Gi"
'

-Xmx4g -Xms4g: Set both the maximum and initial heap memory to 4GB. Replace with values like 2g or 3g as needed. Ensure the maximum and initial values are identical to allow the JVM to allocate a fixed heap size at startup, avoiding GC jitter caused by dynamic resizing.
limits.memory: “8Gi”: After adjustment, the JVM will reserve heap memory equal to the configured size regardless of load. Thus, the container’s maximum available memory should be set to 2-3 times the heap size to ensure sufficient space for the OS, thread stacks, caches, and off-heap memory, preventing OOM (Out of Memory) issues.
No need to modify requests unless you also want to adjust resource reservations for better scheduling alignment.
Note: The heap memory size in ES_JAVA_OPTS must be smaller than limits.memory, and limits.memory must be smaller than the node (host) total memory; otherwise, OOM or scheduling failures may occur.

Restart Elasticsearch


kubectl rollout restart sts multi-master -n default

Validation Test


kubectl exec -it multi-master-0 -n default -- curl -s http://multi-master:9200/_nodes/stats/jvm?pretty   | grep -E '"heap_used_in_bytes"|"heap_max_in_bytes"|"heap_used_percent"'

Sample Output


          "heap_used_in_bytes" : 1731749680,
          "heap_used_percent" : 40,
          "heap_max_in_bytes" : 4294967296,
          "heap_used_in_bytes" : 1786740736,
          "heap_used_percent" : 41,
          "heap_max_in_bytes" : 4294967296,
          "heap_used_in_bytes" : 2153311232,
          "heap_used_percent" : 50,
          "heap_max_in_bytes" : 4294967296,

heap_used_in_bytes: Current heap memory usage (in bytes).
heap_max_in_bytes: Maximum heap memory (in bytes). 4294967296 → 4GB (-Xmx4g).
heap_used_percent: Heap memory usage percentage.

This guide has detailed the steps to adjust the heap memory of an Elasticsearch cluster, from modifying configurations to validating effectiveness. Proper memory configuration significantly improves ES performance and stability, ensuring efficient operation with large datasets. Remember to follow memory size rules and consider overall cluster resource allocation when adjusting configurations.