Redis哨兵client-reconfig-script脚本bug记录一则

前一阵子一直在做自建机房Redis主从的环境搭建。了解到哨兵高可用切换后的会调用client-reconfig-script参数配置的脚本。
但是遇到了一个从2.8版本一直存在至今的BUG。我已经提了一个PR给官方，并被meger了。https://github.com/antirez/redis/pull/7113
特此记录一下。

BUG场景

手动将主实例kill掉，模拟宕机情况。在某些情况下，哨兵已经触发了高可用切换行为（主从状态、日志均有）。但是并没有调用配置的脚本（非必现，但是落到同一台机器调用时并不会调用）
重启该机器上的哨兵节点又恢复正常。（重启大法好）

环境说明

根据网上搜索到的脚本也自己编写了一个。大概逻辑就是

新主IP等于本机IP触发域名切换、元数据修改等操作，并exit 0;
非本机IP不做任何操作，并exit 1;

按道理来说，这个脚本处理逻辑跟网上99%给出的脚本一致，应该问题不大。
的确，在前几次或者短时间内触发多次触发高可用切换，脚本都能够正常执行。
但是遇到以下几种情况下不会触发。

触发2次高可用切换后，本人划水半小时，再来触发，此时脚本不执行。
连续触发10次左右，都正常。总时长在5分钟左右后，脚本不执行。

问题排查

脚本问题？

首先，由于是第一次接触哨兵调用脚本。所以怀疑是自己写的脚本逻辑不正确。于是在编写脚本中每一个操作前都输出日志，甚至在第一行输出东西；结果仍没有调用！
而且轮到其他机器上的哨兵调用脚本时，可能能够调用成功。
所以排除脚本问题。

脚本权限问题？

通过google在Stack Overflow上，以及在Redis交流群中咨询。了解到可能存在脚本权限问题可能会调用失败。
于是关注该脚本在每次被调用后的状态，发现并没有什么变化。并且机器为新机器，只有本人在操作。所以认为权限问题不大可能。

发现共性

在进行多次高可用切换测试后，所有的哨兵切换在执行完高可用切换后，都不再去调用脚本。
这个时候，对所有的哨兵节点状态进行查看。发现有一个共性。sentinel_running_scripts值都为16。
该参数表示正在执行的脚本。
进一步验证，发现：

该值小于16时，会正常调用。
该值会进行周期性的增加。
只触发一次高可用时，该值变成9后不再增加。

源码分析

没有什么问题是阅读源码解决不了的

通过分析哨兵节点进行高可用切换段代码。发现在调用client-reconfig-script脚本时，会根据其返回值做不同处理。

0：表示脚本执行成功。不重试
1：表示脚本执行失败。进行重试，最多10次
大于1：表示脚本执行失败。不进行重试。

bug出现点：

当running_scripts >= SENTINEL_SCRIPT_MAX_RUNNING(16)时就不会再进入到调用脚本的逻辑里。
当调用脚本时，running_scripts++
脚本重试也会触发running_scripts++
只有当脚本达到最大重试次数(10次)，或者脚本返回非1值时，才触发一次running_scripts–

可以看到，在非新主机器脚本执行时，脚本总会exit 1。所以会重试10次。running_scripts+10-1=9。
当遇到两次这样的情况，running_scripts就等于16了。调用脚本逻辑将不再被执行。

/* Run pending scripts if we are not already at max number of running
 * scripts. */
void sentinelRunPendingScripts(void) {
    listNode *ln;
    listIter li;
    mstime_t now = mstime();

    /* Find jobs that are not running and run them, from the top to the
     * tail of the queue, so we run older jobs first. */
    // li是script_queue的一个前向迭代器
    listRewind(sentinel.scripts_queue,&li);
    // 开始遍历running_scripts队列
    while (sentinel.running_scripts < SENTINEL_SCRIPT_MAX_RUNNING &&
           (ln = listNext(&li)) != NULL)
    {
        sentinelScriptJob *sj = ln->value;
        pid_t pid;

        /* Skip if already running. */
        // 跳过正在执行的job
        if (sj->flags & SENTINEL_SCRIPT_RUNNING) continue;

        /* Skip if it's a retry, but not enough time has elapsed. */
        // 还没到执行时间，暂时跳过
        if (sj->start_time && sj->start_time > now) continue;

        sj->flags |= SENTINEL_SCRIPT_RUNNING;
        sj->start_time = mstime();
        sj->retry_num++;
        // fork一个子进程
        pid = fork();

        // fork子进程失败
        if (pid == -1) {
            /* Parent (fork error).
             * We report fork errors as signal 99, in order to unify the
             * reporting with other kind of errors. */
            sentinelEvent(LL_WARNING,"-script-error",NULL,
                          "%s %d %d", sj->argv[0], 99, 0);
            sj->flags &= ~SENTINEL_SCRIPT_RUNNING;
            sj->pid = 0;
        } else if (pid == 0) {
            /* Child */
            execve(sj->argv[0],sj->argv,environ);
            /* If we are here an error occurred. */
            _exit(2); /* Don't retry execution. */
        } else {
            sentinel.running_scripts++;
            sj->pid = pid;
            sentinelEvent(LL_DEBUG,"+script-child",NULL,"%ld",(long)pid);
        }
    }
}


/* Check for scripts that terminated, and remove them from the queue if the
 * script terminated successfully. If instead the script was terminated by
 * a signal, or returned exit code "1", it is scheduled to run again if
 * the max number of retries did not already elapsed. */
void sentinelCollectTerminatedScripts(void) {
    int statloc;
    pid_t pid;

    while ((pid = wait3(&statloc,WNOHANG,NULL)) > 0) {
        int exitcode = WEXITSTATUS(statloc);
        int bysignal = 0;
        listNode *ln;
        sentinelScriptJob *sj;

        if (WIFSIGNALED(statloc)) bysignal = WTERMSIG(statloc);
        sentinelEvent(LL_DEBUG,"-script-child",NULL,"%ld %d %d",
            (long)pid, exitcode, bysignal);

        ln = sentinelGetScriptListNodeByPid(pid);
        if (ln == NULL) {
            serverLog(LL_WARNING,"wait3() returned a pid (%ld) we can't find in our scripts execution queue!", (long)pid);
            continue;
        }
        sj = ln->value;

        /* If the script was terminated by a signal or returns an
         * exit code of "1" (that means: please retry), we reschedule it
         * if the max number of retries is not already reached. */
        // 如果脚本中断或者退出值为1。则重新进入队列，并增加执行时间
        if ((bysignal || exitcode == 1) &&
            sj->retry_num != SENTINEL_SCRIPT_MAX_RETRY)
        {
            sj->flags &= ~SENTINEL_SCRIPT_RUNNING;
            sj->pid = 0;
            sj->start_time = mstime() +
                             sentinelScriptRetryDelay(sj->retry_num);
        } else {
            /* Otherwise let's remove the script, but log the event if the
             * execution did not terminated in the best of the ways. */
            // 如果是中断或者不成功，则是因为到达了执行次数上线，打印出错误日志
            if (bysignal || exitcode != 0) {
                sentinelEvent(LL_WARNING,"-script-error",NULL,
                              "%s %d %d", sj->argv[0], bysignal, exitcode);
            }
            // 这个地方只会在成功或者重试了10才执行到。
            listDelNode(sentinel.scripts_queue,ln);
            sentinelReleaseScriptJob(sj);
            sentinel.running_scripts--;
        }
    }
}

解决方案

临时解决方案

脚本不exit 1。exit 2表示失败，即不进行哨兵重试调用脚本行为。

源码修复

将上面源码中的sentinel.running_scripts--;提到else之外。即使exit 1也需要减一。