Authored by: H. Tsai (IBM Research), L. Zhuang (IBM Research), H. Krishnareddy (IBM Research), D. Dunn (IBM Research), W. Clarke (IBM Research), J. Beaulieu (IBM Research), E. Seabolt (IBM Research), G. Nam (IBM Research), V. Mukherjee (IBM Research), D. Ziger (Synopsys), S. Rhie (Synopsys), Y. Wang (Synopsys), C. King (Synopsys), R. Anna (Synopsys)
In 2024, IBM and Synopsys achieved yet another remarkable milestone by scaling an Extreme Ultraviolet (EUV) Optical Proximity Correction (OPC) beyond 20,000 cores across multiple IBM Cloud clusters globally. This is a continuation of our strong partnership since 2021 where we have demonstrated a linear reduction in run time with 11,400 cores.[1]
What is OPC and Why it Matters
In a typical mask synthesis flow, the design data go through several stages of transformations to reach the final patterns needed on a set of lithography masks for the precise printing of desired shapes on wafers. Among those stages, OPC is the most compute-intensive step. As technology advances, the number of mask shapes and the complexity of OPC recipes increase, driving up compute capacity requirements. The fluctuating compute capacity usage, depending on the mask level, makes OPC workloads ideal for the Cloud, where resources can be dynamically provisioned to meet turnaround time demands.
IBM's Cloud bursting infrastructure for OPC includes IBM Storage Scale (formerly GPFS) for storage, compute cores from Virtual Server Instances (VSIs), and IBM Spectrum LSF for job management. The infrastructure is optimized for high-performance OPC workloads, using pre-built images, Terraform, and Ansible playbooks for automation, creating a consistent and scalable environment. LSF connections were established between on-premises and multiple IBM Cloud clusters in various locations, making the execution of OPC jobs on remote clusters as simple as submitting jobs to a different LSF queue. Typically, a single mask is divided into smaller chiplets, with each chiplet's OPC executed as a separate job. Consequently, OPC for a single mask may comprise multiple OPC jobs. These jobs are automatically submitted to various clusters via an automation script that leverages LSF's advanced reservation features.
A distinctive aspect of OPC jobs is their 2-step process: an initial primary job runs on a single host, followed by a secondary worker job that distributes workloads across multiple hosts. To guarantee resource availability within a cluster for both primary and worker jobs, we utilize a polling method to determine the necessary resources. We then establish advanced reservations through LSF before submitting the primary job to the LSF queue. This LSF setup enabled us to demonstrate large-scale OPC production with Synopsys Proteus across multiple LSF queues, utilizing up to 19,800 physical cores for a single job and up to 26,600 cores simultaneously across multiple jobs.
IBM demonstrated an optimized Cloud infrastructure for OPC and Lithography Rule Check (LRC) workloads, achieving near-linear performance scaling on the main compute portion of the OPC workloads using up to 19,800 physical cores per job for a 2nm technology node EUV active area layout using LSF and Synopsys Proteus on IBM Cloud. The layout is 81.8 mm2 in area. The remote cluster was built across 3 Availability Zones (AZs). We do observe deviation from linear scaling above 15,000 cores, likely due to network and file system latency, especially for the cross-AZ compute nodes. While OPC workloads involve iteratively optimizing mask shapes to produce the best imaging quality on wafer, LRC workloads verify that the final wafer imaging results pass a set of pre-defined rules to qualify the mask for manufacturing. LRC using Synopsys Proteus (PLRC) exhibits different scaling behavior from OPC and was optimized to minimize library loading and file removal overhead in the main compute segment of the workload. Further optimization is underway to achieve linear scaling for both OPC and PLRC with 20,000 cores and beyond.
IBM and Synopsys share a vision for a Cloud-native future using Kubernetes, allowing integration with modern AI platforms like Red Hat OpenShift AI. Kubernetes offers advantages over traditional HPC schedulers in managing large-scale infrastructure -- potentially hundreds of thousands of nodes. Although still in the early stages of research, Kubernetes native HPC is expected to gain strong support in the long term.[2]
Using Kubernetes on IBM Cloud, OPC and PLRC workloads were scaled to 21,000 physical cores per job for the same 2nm technology node EUV active area layout. While the scalability may not be as optimal as the LSF cluster, we are confident that further tuning and optimization of the Kubernetes overlay network can achieve a more linear scaling curve with an increased number of cores.
The collaboration between IBM and Synopsys is a testament to the power of Cloud computing in handling highly parallelizable and compute-intensive tasks like OPC. By pushing the limits of scalability, they are paving the way for future innovations in semiconductor manufacturing, leveraging the dynamic and scalable capabilities of Cloud infrastructure.
References
[1] https://www.ibm.com/blog/ibm-and-synopsys-demonstrate-euv-opc-workload-runs-11000-cores-on-the-hybrid-cloud/
[2] https://thenewstack.io/kubernetes-evolution-from-microservices-to-batch-processing-powerhouse/