Experience in Optimizing MSA Backend for Large-scale Clinical Data Processing and Legacy Integration
Experience in MSA backend optimization for large-scale clinical data processing and legacy integration (AMC PROMs construction)
1. Introduction and technical challenges
The core data supporting the key paradigm of modern medicine, 'Patient-Centered Care', is PROMs (Patient-Reported Outcomes Measures). I was responsible for the backend development of the PROMs platform at Asan Medical Center (AMC), undertaking the mission of reliably collecting and analyzing vast clinical data and integrating it with the massive hospital core system (AMIS). In this process, I would like to share the four technical challenges I faced: certification limitations, bottlenecks in real-time data visualization, OOM in large-scale retrospective calculations, and the risk of failure propagation, and how I resolved them from an architectural perspective.
2. Non-standard Payload-based authentication adapter that overcomes HTTP Header constraints
[Problem Situation: Limitations of HTTP Header Manipulation in Legacy Environment] The latest MSA environment and Spring Security adhere to the standard protocol of exchanging JWTs through HTTP Authorization Headers. However, due to structural constraints, the hospital core system (AMIS) was unable to send tokens in the standard HTTP header and instead had to transmit encrypted authentication information in a specific field (encToken) within the Request Body (AmcData).
[Solution: Implementation of ContentCachingRequestWrapper and Custom Filter] In order to accommodate non-standard communication while maintaining internal security standards, a custom Security Filter was implemented at the front end. Due to servlet characteristics, the InputStream gets lost after being read once, so ContentCachingRequestWrapper was applied to cache the Request Body. The encToken from the cached Body was extracted in the filter and validated against the Keycloak server, after which valid information was injected into the ThreadLocal-based ContextHolder. This allowed us to create an authentication adapter that perfectly follows the standard authorization control flow within the server while accommodating external non-standard integration.
3. Introducing Kafka & CQRS Patterns for Real-Time Clinical Data Visualization
[Problem Situation: Real-Time DB Bottleneck Due to Complex Medical Formula Calculations] Before the patient submits the medical questionnaire and enters the consultation room, the pain index and quality of life (QoL) trends should already be visualized on the physician's monitor. If the medical staff has to join vast historical survey response tables and calculate medical rules every time they open the dashboard, it will lead to serious query latency, causing inconvenience to the user.
[Solution: Build Event-Driven and Separate Query Model (QM)]
To ensure real-time capabilities, we combined CQRS (Command Query Responsibility Segregation) architecture with Kafka event streaming.
Lightweight Command: When a patient submits a survey, only the raw data is quickly inserted, and the response is returned, while an event is asynchronously published to the Kafka Topic immediately.
Asynchronous Pre-calculation: The consumer subscribes to the event and performs heavy computations (conditional item processing, score aggregation) in the background based on the dynamic rule engine (SpEL).
Query Model Optimization: The calculated results are loaded into a 'statistics table' that is extremely flattened for querying. As a result, healthcare professionals can read summary data immediately without heavy computations (O(1) performance), allowing for pleasant querying even as data accumulates.
4. Optimization to prevent JPA OOM during processing of large-scale retrospective calculations
[Problem Situation: Saturation of Persistence Context (L1 Cache) when querying hundreds of thousands] It was confirmed that a critical issue occurs when the statistical criteria are changed and the past survey data from several years is recalculated (retroactively applied) at once, resulting in over 100,000 entities being loaded into the persistence context (1st level cache) and causing OOM (Out of Memory) as the garbage collector (GC) is unable to operate.
[Solution: Chunk Processing and Explicit Memory Return] To avoid compromising system availability, we applied the 'chunk-based memory control' technique. The entire data was divided (sliced) into units of 1,000 records through Offset and Limit for querying, and after calculations, Bulk Insert was applied. The most crucial tuning was explicitly calling entityManager.clear() after every 1,000 record operation to clear the persistence context. This way, even if the target increases to millions of records, the server's heap memory usage remains at a flat level of around 1,000 records.
5. Design to Prevent Cascading Failure during Core System Integration
[Problem Situation: Internal Thread Pool Exhaustion due to External API Delays] If a temporary load or delay occurs in the AMIS system, the Tomcat Threads of the PROMs server waiting for that response could become indefinitely blocked, posing a risk of cascading failure that paralyzes the entire platform.
[Solution: Intelligent RestClient and Fault Tolerance Logic]
We introduced Spring 6's RestClient to defensively redesign the communication client to block sources of risk.
Strict Timeout Isolation: By enforcing a Connect Timeout of 5 seconds and a Read Timeout of 30 seconds, we ensured that connections are terminated and resources are reclaimed within 30 seconds in the event of a server failure to maintain system independence.
Error Bypass Technique: We implemented a logic that accepts specific error codes (ignorableMessageIds) as variable arguments to bypass (Bypass) without always throwing an Exception on a failed response. This helped defend against the instability of the core system affecting the patient's inquiry experience.
6. Conclusion
This project has been a valuable experience in rigorously considering and addressing how backend systems should withstand a surge in traffic and an ever-increasing accumulation of data in enterprise environments. The flexible acceptance of non-standard authentication methods, ensuring real-time data availability through CQRS, optimizing large-scale memory through primary cache control, and employing defensive programming techniques to prevent failure propagation have become strong technical assets that instill 'scalability and robustness' into the system, going beyond mere functionality. Over the past six months, the entire team dedicated themselves day and night to complete the project, overcoming numerous challenges. The heartfelt smile of the client and the letter of gratitude received at the end of that intense process is a memory I will never forget. As an engineer, I am incredibly happy and fulfilled to have been able to firmly establish 'NEXTREE' as the best technical partner that clients can trust and rely on. Moving forward, I will use this experience as nourishment to design a robust architecture that maximizes our clients' business value.