Experience with Quartz Scheduler

Experience with Quartz Scheduler

1. Introduction: Why was distributed scheduling necessary?

In Vizend, each service was previously handling scheduling individually. This structure works fine in simple environments, but as the services scale, issues such as duplicate execution of the same task running across multiple instances, unrecoverable loss of schedules during failures, and operational complexity due to distributed scheduling logic across services arise.

To address these issues, we have developed a scheduling microservice responsible for executing time-based tasks. When other services send requests to "publish an event at a specific time" or "call an API at a specific time," it executes exactly at that time. It can publish events to a Kafka topic or call external HTTP APIs at the designated time. Additionally, it supports both recurring schedules (seconds/minutes/hours/days/weeks/months/years) and one-time schedules. For example, it can "publish a billing notification event every day at 9 AM" or "call the settlement API once at midnight on December 31, 2024." Furthermore, it aims to support dynamic schedule registration/modification/deletion (based on REST API), ensure single execution without duplication in distributed environments, and provide automatic recovery in case of failures.

2. Technology Choice: Why Quartz JDBC Clustering?

Spring's @ScheduledI considered simple scheduling using annotations, but there were limitations. I reviewed other alternatives for three reasons: duplicate execution, inability to recover from failures, and the inability to manage dynamic schedules. The technology stack I found to implement distributed scheduling includes ShedLock, db-scheduler, Spring Batch, and Quartz Scheduler. ShedLock is based on simple locking, making complex schedule expressions difficult. db-scheduler is lightweight but has limited trigger expressions, and Spring Batch is batch-oriented, making it unsuitable for real-time scheduling.

The reason I ultimately chose the JDBC Clustering mode of Quartz Scheduler is as follows.

  1. JDBC-based cluster coordination: It is possible to coordinate between cluster nodes using only the RDBMS already in use, without a separate ZooKeeper or Redis.

  2. Dynamic Schedule Management: You can register, modify, and delete jobs at runtime, allowing for schedule management via REST API.

  3. Rich trigger types:Supports various repetition patterns natively, such as CronTrigger, SimpleTrigger, and CalendarIntervalTrigger.

  4. Misfire Handling: You can finely control the policy (immediate execution, ignore, reschedule, etc.) for missed tasks due to network delays or node failures.

  5. Spring Boot Integration:Automatic configuration is provided through spring-boot-starter-quartz, which has the advantage of a relatively simple initial setup.

We used Quartz JDBC Clustering, which enables safe scheduling in a distributed environment without additional infrastructure.

3. Design: Separation of roles between Application and Quartz

The core is adopting a structure that clearly separates the responsibilities of Application and Quartz. When the Client generates and delivers schedule-related information, it is stored in the DB and a domain event is published. Afterward, the domain event is received, and Jobs and Triggers to be registered with Quartz are created based on the schedule information, which are then reflected in the Scheduler. The schedules registered in Quartz are processed by the internal operations of the Quartz Scheduler, while the App is responsible for managing the execution context and logging functions.

In summary, the Application only manages schedule definitions and logs, while Quartz is responsible for executing the schedules.

4. Implementation Core

4.1 Cluster Configuration

The core of Quartz clustering (distributed scheduling) is application.ymlThis is a configuration. The scheduling services in multiple pods look at the same DB, coordinating the execution of tasks through DB locks.

spring:
  quartz:
    job-store-type: jdbc
    jdbc:
      initialize-schema: never    # DB Schema를 별도로 관리하는 경우 (예시: Flyway)
    properties:
      org.quartz:
        scheduler:
          instanceName: SchedulerName
          instanceId: AUTO          # 노드마다 고유 ID 자동 생성
        jobStore:
          class: LocalDataSourceJobStore
          isClustered: true          # 클러스터 모드 활성화
	    tablePrefix: qrtz_        # 테이블 prefix 설정(vendor별 대소문자 특징 주의)
          clusterCheckinInterval: 10000  # 10초마다 하트비트
          misfireThreshold: 60000    # 60초 이내 지연은 허용
        threadPool:
          class: SimpleThreadPool
          threadCount: 10            # 노드당 동시 실행 10개
          threadPriority: 5

If we look at the meaning of each configuration item:

  • isClustered: Activates or deactivates cluster mode. When this setting is enabled, Quartz will use the DB's QRTZ_LOCKS acquire distributed locks through the table, QRTZ_SCHEDULER_STATERecords a heartbeat periodically to the table.

  • instanceId: AUTOA unique instance ID is automatically generated for each Pod when it starts. In a K8s environment, a different ID is assigned to each Pod, allowing each node to be identified within the cluster.

  • clusterCheckinInterval: 10000: Heartbeats are recorded in the DB every 10 seconds. Other nodes check this heartbeat to determine the survival status of a specific node. If a node does not respond for several times longer than this interval, it is considered dead, and the work being occupied by that node is transferred to another node.

  • misfireThreshold: 60000: A delay of up to 60 seconds from the scheduled execution time is considered normal.

  • threadCount: 10: Up to 10 jobs can be run simultaneously on a single node. The maximum concurrent throughput for the entire cluster is determined by (number of nodes × 10).

In order for clustering to work, a Quartz-specific table must exist in the DB.

(Reference - ddl) https://github.com/quartznet/quartznet/tree/main/database/tables

Overall, about 11 Quartz tables are needed, and I have configured them to be automatically created during service deployment with Flyway migration scripts.

4.2 Trigger Strategy

When registering a job (schedule) in Quartz, simple triggers have limitations for various repetition schedule settings ranging from seconds to years. It is difficult to accurately express patterns such as "every 3 days" with just a cron expression (with fixed dates like the 1st, 4th, 7th of each month), as well as patterns like "every 2 weeks" or "every 3 months" which cannot be expressed with cron either. Additionally, during DST (Daylight Saving Time) transitions, CronTrigger interprets time literally, which can lead to unexpected behavior. Therefore, CalendarIntervalTrigger has been used alongside it for accurate calendar-based calculations of "N days later", "N weeks later", and "N months later". Furthermore, Quartz Trigger allows setting a misfire policy in case of schedule execution failure. In most cases, the policy used was to immediately execute missed runs once before continuing with the next regular schedule.

4.3 Smart Rescheduling

Synchronization of schedules between the App and Quartz Scheduler is important. When the schedule data managed by the App is changed, it must also be reflected in Quartz's Scheduler. It is crucial to accurately distinguish between cases where triggers need to be re-registered and where they do not. When a schedule entity is modified, a domain event should be published, and it is essential to clearly identify whether to re-register or update the schedule in Quartz based on this event.

If rescheduling is done for every schedule modification, it will fall into an infinite loop of completion → status change → rescheduling → execution → status change → ... It is important to classify the type of changes and correctly determine whether fields related to schedule execution have been modified or if unrelated fields have changed to set the conditions.

5. Problems Encountered and Solutions

5.1 Preventing Duplicate Execution: Double Defense

While testing in a cluster environment, a phenomenon occurred where the same event was published twice, and the same schedule was duplicated due to the event broker's retransmission. To prevent the same job from running simultaneously on multiple nodes, a two-step defense mechanism was implemented. The first step is applying the @DisallowConcurrentExecution annotation to the job class.

The cluster DB lock prevents multiple nodes from grabbing the same trigger at the same time, ensuring that "only one node executes," and @DisallowConcurrentExecution prevents the next trigger from starting if the previous execution has not finished in a repeated schedule, thus avoiding simultaneous execution of the same job more than twice. Secondly, the logic was set to check for duplicates when registering schedules with Quartz.

Determining the "Last Execution" of a Recurring Schedule

There was an issue where the recurring schedule continued to execute even after reaching the termination conditions, and there were cases of over-execution in schedules with limits on the number of runs. There are various termination conditions for repetition, such as ending after 5 executions or prohibiting execution after a specified time, which necessitated considering many cases. The key is to accurately compare the schedule information managed by the app with the schedule information automatically managed by Quartz. The termination determination involves a dual verification at both the Quartz and Application levels. For instance, when checking the next execution of a recurring schedule, it is possible to determine whether the next recurring schedule exists by using the values automatically calculated by Quartz (such as Next Fire Time). After that, it is necessary to check the termination time/number of repetitions set when defining the schedule to ascertain whether the repetition has ended or continues.

5.3 TimeZone Unification

When dealing with time, it is essential to always consider TimeZone settings. If time zone handling is managed inconsistently in various situations, it can lead to confusion for both users and developers. Therefore, time information has been processed based on UTC in all areas, including logic processing, DB storage, event payloads, and API Request/Response. It is only when displaying on the front end that dayjs is used to represent the time in the browser's timezone. This allows for a unified time management system, preventing confusion for both developers and users.

6. Conclusion

The most significant realization while applying JDBC Clustering in Quartz Scheduler is the practicality of DB-based coordination.. Without a separate distributed coordinator, it was possible to implement reliable cluster scheduling solely with row-level locks of RDBMS. The biggest advantage was that the infrastructure complexity did not increase since we utilized an already operational database. Next was the harmony with event-driven architecture. By linking the lifecycle of the schedule (registration/modification/deletion) with Quartz through domain events, the separation of concerns became natural. However, it was crucial to anticipate and defend against side effects (such as infinite loops) caused by event chaining in advance. Lastly, there is the response to various cases. There were many considerations, such as branching the trigger types according to usage patterns, adjusting the schedule based on various event termination conditions, and predicting and defending against side effects caused by event chaining. Currently, the service mentioned in this document is not yet actively used. It is expected to continuously evolve in the process of adding features in the future and solving more problems as it gets utilized in various contexts.

Quartz is an old project developed since 2001, but the core mechanism of JDBC clustering remains robust and practical. Particularly in a microservices environment that is already using RDBMS, I believe it is a sufficiently practical option when distributed scheduling needs to be implemented without additional infrastructure.

Reference Material

Quartz Scheduler Official Website

Spring Quartz Official Website

Tistory, Velog Blog-

Brown

Site footer