What you need to know about TLS session ID caching in a sysplex and why it is important.
Every secure, high volume web server which uses TLS 1.2 or earlier would likely collapse under the burden of encryption processing were it not for TLS session ID caching. So what is this and how is it applicable to CICS regions in a z/OS sysplex clustered environment?
The TLS session ID cache
I’m not going to explain the intricacies of the full TLS 1.2 session handshake protocol, but suffice to say it requires multiple network exchanges between the client and server whenever a new HTTP connection is made. As part of the handshake, expensive asymmetric cryptographic operations must be performed to generate the encryption keys for the session.
CICS Sockets Domain (SO) handles TCP/IP communications and, when configured to do so, ensures that TLS protection is applied to the CICS transactions that flow across TCP/IP. CICS offers three different approaches to TLS protection:
- SO can invoke a z/OS component called System SSL to manage the TLS protocol processing (as depicted in the first diagram below). In this case, CICS invokes System SSL on its dedicated S8 TCBs.
- When acting in the server role, SO can delegate TLS processing to the TCP/IP stack using a z/OS Communications Server feature called Application Transparent TLS (AT-TLS). In this case, AT-TLS invokes System SSL on behalf of CICS (as depicted in the second diagram below).
- Liberty uses the TLS capabilities of the JVM provided by JSSE. This scenario is NOT discussed here.
In the two illustrated cases, System SSL performs the low-level TLS protocol processing. This article will focus primarily on the first of these approaches and at the end of the article, we will quickly explore the configuration differences when using AT-TLS.
Session ID caching is an optimisation of the TLS handshake protocol whereby a cache is used to store session IDs for TLS sessions between clients and CICS®. System SSL supports session ID caching. Allowing the TLS client to reuse these session IDs allows System SSL to perform partial handshakes with clients that it has previously authenticated, which provides for a significant reduction in encryption processing compared to the full handshake.
The session ID cache can be local to a CICS region or shared between CICS regions on a sysplex. This is configured by the system initialization parameter SSLCACHE.
Of course, with HTTP v1.1 the default is that HTTP connections between clients and CICS can persist. That is, once established on the first request, the underlying TCP/IP socket connection will remain open between the specific client and CICS for use by a subsequent request. This is the first level of efficiency and negates any need for session caching as the TLS handshake is only required for new network connections.
So persistent connections are great, unless of course they persist too long. Persisting forever, for example, inhibits how you might expect the system to distribute connections across cloned systems in a sysplex listening on a shared port or IP address. So CICS allows you to limit the number of persistent connections it maintains concurrently, and to limit the time for which the connections persist.
Ideally in a steady state there will be a low rate at which connections are closed and re-established, thus balancing reuse with adaptation to workload.
Note: Connection reuse is per client system, so if you have a consistent, fairly static set of clients making the requests, then this will optimize the overheads of making new connections. A more dynamic set of clients will be establishing brand new connections more frequently.
Persistent connections in a sysplex
If you’re using multiple CICS regions to serve the workload then port-sharing and Sysplex Distributor capabilities of z/OS Communications Server will be making the choice of where to route each connection request. The client sees one hostname and port, the sysplex makes the cloned CICS regions appear as a single network endpoint.
This is one of the cases where periodic closure of connections is a good thing as it provides the opportunity for the IP stack to continue to be involved in decisions about where connections are routed on a continual basis and not just when the workload starts.
How steady is your steady-state?
This great idea of persistent connections means there’s a pool of connections maintained between each client and CICS. Since the client is initiating the request, it can reference its pool of currently free connections when making a request. If there’s one in the pool then it’s used - otherwise the client can decide to ask for a brand new connection and expand the pool. Of course, clients might allow some policy which limits the pool size… or maybe it doesn’t.
Handling a workload spike
If the pool self-tunes according to workload, then we need to understand how it will behave if workload increases at a fast rate, i.e. a workload spike. Depending on the client implementation, an increase in the request rate might mean that the client has many more requests than free persistent connections in its pool, and it will simply send reqests for new connections to CICS to deal with the requests.
Without any client flow control, the rate of connection requests can go from practically zero to the arrival rate of new requests in the client. Zero to one hundred miles an hour in a few seconds using an automotive analogy. Now you need to worry about the expense of each new connection request on the server, just at the time when there’s apparently a lot of demand for service!
Remind me how TLS session caching works
As mentioned above, the cache is what allows a client to ‘resume’ a session with a server with minimal overhead.
The cache can’t avoid the cost of making an initial TLS connection request, the client and the server exchange their public key cryptographic information, validate each other and agree how to encrypt their messages using relatively cheap symetric encryption.
The value of the TLS session ID cache is that the handshake for that first connection can provide a session ID which allows other connection requests from the same client to the same server to use a partial handshake. Partial handshakes are much cheaper to perform than full handshakes. For the time period that the session ID is valid, the client can use it to open connections to the server - re-opening or opening a new connection counts as ‘resuming’ the TLS session.
The resumption protocol specifies how a session ID is in effect an agreement that if the client supplies it on the TLS clientHello the server can look in its cache which is keyed by the session ID and find enough information to avoid the full handshake.
Session IDs have a limited lifetime for which they are valid. Thus, part of the protocol is the possibility that either the client or the server might decide that a session ID is no longer valid, in which case a full handshake is performed and a new session ID is generated.
Ensuring the session ID works across the sysplex
Now we come to the point of this article, why you should use
SSLCACHE=SYSPLEX under most circumstances, and not the default which is
Remember, in a sysplex, multiple CICS can look like one hostname/port. We know that a client sees multiple CICS regions which listen on one hostname and port as one server, and thus expects any of them to be able to support its cached TLS session IDs. We know that each connection request will be routed by the IP stack to one of those regions, and the next request to a different CICS in all likelihood.
As you can imagine,
SSLCACHE=CICS means each CICS has its own private cache (managed by System SSL) Now recall that a session ID is a key into a cache. The client will expect to use a session ID generated by a first full handshake on a second connection request, but with our sysplex function to spread connection requests to different CICS regions, that session ID is not going to be valid when the client reuses it, unless all the CICS regions share a single session ID cache.
SSLCACHE=CICS reusing a session ID is practically guaranteed to fail. Not only does this mean that every connection request where the client expects a session ID to work reverts to a full handshake, the client has to assume that the session ID it tried to use is now invalid for the server cluster. The session ID remains in the cache of the CICS region that generated it, but the client will discard it since the server has said it’s not valid. So in short, session IDs offer no practical value when when multiple CICS regions listen on the same hostname and port and use
This non-functioning cache might not be noticed when the connection rate is low in your steady-state. But if your workload spikes and many new connection requests are distributed across those multiple CICS regions you have active for just such eventualities - that’s when you will notice they are all performing full handshakes. Not a good time to notice!
… and it might get worse!
So now the workload is spiking, but you have plenty of CICS S8 TCBs and crypto-hardware so even full handshakes shouldn’t be too much of an overhead, the client will only need to request a few more connections.
But what if the client keeps requesting new sessions as the response time for the existing connections is too high. How likely is this? Well, the arrival rate already spiked which exhausted the pool, and now full handshakes are being executed, and the increased processing demands of those handshakes are slowing down requests even on the established connections. So connections are busy for longer, the client finds the pool full more often and demands new connections to expand the pool… which slows down requests… you see where we might end up?
The answer… a sysplex shared cache
Luckily, the answer is at hand
SSLCACHE=SYSPLEX is a CICS SIT option which simply asks System SSL to behave such that any TLS session ID generated from any CICS in the sysplex is valid throughout the sysplex. So instead of these IDs hardly ever being valid when connections are spread around, they are valid and provide all the expected benefits. This enables clients to re-use a session ID across any system in the sysplex, until the defined expiry time (or cache utilization means it is discarded).
Works great in a single LPAR or a sysplex
Although this option is associated with a sysplex, it also works just as well if using a cluster of regions on a single LPAR sharing a common port. If
SSLCACHE=CICS is used in this scenario, session IDs from other regions even in the same LPAR will be seen as invalid and the cache will provide no benefit to TLS handshaking.
SSLCACHE=SYSPLEX requires activating the z/OS System SSL started task GSKSRVR, which provides:
- An LPAR-level cache directly validating any session ID generated on that LPAR
- A federation of caches enabling session IDs to be efficiently validated no matter which LPAR in a sysplex the connection request is received on
The federation requires no Coupling Facility structures to be defined. Each GSKSRVR can directly contact a partner GSKSRVR on the LPAR which generated the session ID using XCF communications. So as well as no CF structures, there is no sysplex-wide searching or broadcast required - System SSL and GSKSRVR are clever enough to know exactly where to go!
You do need to ensure that all the regions which need to benefit from the sysplex cache:
- either use the same region user ID or use different user IDs which have READ access to each other's profile (GSK.SIDCACHE.user profile in the FACILITY class),
- or use the same external security manager.
See Sysplex session cache support in the z/OS Knowledge Center.
What about single CICS with their own hostnames?
Worried about using a sysplex capability when you really don’t need it? With this implementation, there’s only a very small cross-memory cost to check the local GSKSRVR when employing
SSLCACHE=SYSPLEX even when you don’t strictly need it.
If you’re confident that TLS clients will only ever connect to particular single regions (small workloads or development/test regions), that will work without incuring any sysplex-wide overheads - the session id will be found in the local GSKSRVR’s cache.
What’s not to like?
So how does this work with AT-TLS?
Everything I just described also applies when CICS is using AT-TLS. The only difference is in the way you configure it.
With AT-TLS, all TLS-related operational parameters are specified in AT-TLS policy statements. All the parameters that affect session ID caching are specified on the
TTLSGskAdvancedParms statement (if you use the z/OSMF Network Configuration Assistant to configure your AT-TLS policy, most of these parameters are accessed through the
AT-TLS->Security Level->Advanced Settings dialog under the Tuning tab). So when using AT-TLS to protect inbound connections to CICS servers, the CICS SSLCACHE parameter has no bearing on the session ID caching for those connections.
The AT-TLS default behavior relative to session ID caching is very similar to that of CICS.
By default, AT-TLS enables session ID caching for each local System SSL environment (analogous to CICS
With AT-TLS, these environments are defined by the
A given TCP/IP stack can support multiple
TTLSEnvironments, and the
TTLSGskAdvancedParms statement is an extension of the
AT-TLS supports configuration parameters to control:
GSK_V3_SIDCACHE_SIZE - the number of entries in the session ID cache (default 512)
GSK_V3_SESSION_TIMEOUT - the lifetime of the cache entries (default 86,400 seconds)
Selecting Sysplex-scope for AT-TLS session id cache
By default, an AT-TLS session ID cache is NOT scoped across a Sysplex.
GSK_SYSPLEX_CACHE ON on the relevant TTLSGskAdvancedParms statement AT-TLS turns on the Sysplex caching (analogous to CICS
If you use the Network Configuration Assistant, you can enable sysplex session ID caching by selecting “Use Sysplex session identifier caching” on the
AT-TLS->TCP/IP Stack->Connectivity Rule->Advanced dialog under the Tuning tab.
As described earlier, the
GSKSRVR started task must be running on each LPAR that will share the cache.
As of z/OS V2R4, AT-TLS and System SSL also support the new TLSv1.3 protocol. (Note that CICS does not support TLSv1.3 when invoking System SSL directly.)
While TLSv1.3 continues to support TLS session resumption (partial handshakes), it uses a different approach issuing session tickets to clients which replace session ids and the server-side cache. TLSv1.3 session tickets are opaque objects that contain all the information that server needs to perform a partial handshake at a later time.
In z/OS V2R4, AT-TLS supports TLSv1.3 session tickets by default, but only at the TTLSEnvironment scope (ie within a scope limited to a given TCP/IP stack).
Currently System SSL does not support sharing TLSv1.3 session tickets across the sysplex. The
TTLSGskAdvancedParms policy statement provides a few different parameters to control session ticket behavior when AT-TLS is acting as a server. All of these parameters begin with the prefix
Summing up… just do it!
Optimization is almost always a good idea! Unfortunately, you might be not realising that an important network security optimization is actually ineffective in your system. The impact of finding that out when workloads are at their most demanding can be rather unfortunate, whereas the operational and resource costs of fixing this before it happens are negligible.
Thanks to Phil Wakelin, Alyson Comer, Jonathan Cottrell and Chris Meyer for helping to write and review this article.