IBM Security Verify

 View Only

Running untrusted code in SaaS

By Vivek Shankar posted Wed July 20, 2022 08:53 PM

  

Consider this:

  • You have a SaaS service that needs to offer several customization options that requires some type of scripting
  • The "code" runs on your service
  • The "code" can make API calls out of the boundary of your system
  • You offer a free trial that allows anyone in the world to use and potentially exploit this capability
  • You use shared compute resources to drive down cost


This is a performance and security nightmare for a multi-tenanted environment. I grappled with this few years back with IBM Security Verify SaaS (then IBM Cloud Identity). As with any other problem, the first thing I did was to take a peek at online search results and came across this great article on the challenges of running user-supplied (i.e. untrusted) JavaScript. This article does a great job of listing the challenges, so I won't repeat them here.

Can we use JavaScript?

JavaScript and Node JS is probably the most used language now, but it is extremely difficult to sandbox and jail code from:

  1. Accessing the process space and messing with it
  2. Peeking into other processes and hijacking compute resources
  3. Executing tight loops and I/O exploits
  4. Printing out system credentials

And so on...


But couldn't we isolate using containers? Yes, this is possible and is the approach used with popular compute engines like AWS Lambda, IBM Cloud Functions and others. The diagram below illustrates one way to implement this.



This requires the following:

  • A massive compute farm with pre-warmed containers, to avoid paying the cost of initializing the container process (usually nodejs)
  • A method to assign a container to a tenant based on the request, to isolate execution of scripts
  • A method to remove the assignment if the container has been idle for a certain period, to recover resources
  • Ensure the container runs rootless and there is nothing in the environment variables that might resemble a credential or anything sensitive
  • Lock down the HTTP and other targets you can call out to
  • Keep up with fixes for the latest vulnerabilities, particularly the ones that allow unprivileged processes to break out of the container boundary

How did I solve this problem? Introducing CEL

Google's common expression language (CEL) allows single-line scripting by implementing expression syntax checking and evaluation. This was originally designed to represent policy rules in Firebase and other Google products, but we were able to re-use this extensible expression language for the purpose of executing user-supplied code in a variety of use cases. Here are a few:

  • Computing attribute claim values for an OAuth authorization grant
  • Building an account payload that is provisioned into an external system
  • Author access control policies using scripted rules
  • Transform request and response payloads for web hook API calls

This language worked very well from a performance and security perspective. As the introduction to the language says:

The Common Expression Language (CEL) is a non-Turing complete language designed for simplicity, speed, safety, and portability.


However, while we did add several utilities, data types and functions to supplement the default syntax, this became cumbersome very quickly for more complex expressions. Here is a relatively simple example:


user.getManager() != null ? (user.getManager().firstName + " " + user.getManager().lastName) : ""


Introducing multi-line expressions - CEL++

Rather than invent a completely new language, we chose to represent multi-line statements in a YAML document. Here is an example:


statements:
- context: "manager := user.getManager()"
- if:
match: context.manager != null

block:
- return: context.manager.firstName + " " + context.manager.lastName
- return: ""


Each YAML property value is a CEL expression. This extension offers several capabilities:

  • Variable assignment
  • Conditional checks and nested conditionals
  • Statement blocks and local variables
  • Return statements to exit the evaluation


The full syntax document can be accessed here.


Advantages to this approach

  • Compute resources are shared and doesn't require any complicated tenant isolation process
  • CEL provides support for most operators and functions that you would expect in an expression language
  • The expression is guaranteed to finish executing
  • Using CEL in a YAML-based language simplifies development, while maintaining the advantages

The architecture looks a lot simpler, as you can see here.


Things to consider with CEL


While this article is focused on how Verify used CEL, the following can be applied to any Expression Language.

  • CEL is extensible, so you might be tempted to add a number of utilities and functions. Be wary of what you add because you might end up introducing Turing-complete characteristics that is guaranteed to complete without any consideration for resource consumption.
  • Limit the number of instructions allowed and/or ensure that the expression can be terminated when it exceeds a maximum computation time. We use Go in IBM Security Verify for this engine and leverage the Go context timeout to terminate the program.
  • If you implement utilities like HTTP clients, limit the memory consumption and make liberal use of timeouts.
  • Build layers and controls - rate limits, quarantine measures for badly behaving scripts, size of the expression, etc.

Conclusion

Running untrusted code in a multi-tenanted SaaS environment is hard. While a compute farm can be built to execute common languages, such as JavaScript, Python, Groovy and others, it comes with significant challenges around runtime compute resource utilization, tenant allocation, managing the blast radius in the event of exploits, to name a few. IBM Security Verify extends the Google Common Expression Language to solve this problem along with several layers of controls on top of the computation environment.

0 comments
29 views

Permalink