Maximo

Maximo

Come for answers, stay for best practices. All we're missing is you.

 View Only

Effective IoT connectivity at scale 

Wed November 24, 2021 10:23 AM

Effective IoT connectivity at scale


IBM Maximo Application Suite[1] (MAS) provides internet of things (IoT) connectivity to your enterprise through the IBM Maximo Monitor[2] application. IoT provides the ability to define and manage an estate of connected devices in your wider Maximo solution, opening opportunities to better understand and manage your asset estate through real-time communications.

Implementing a transformational IoT solution that works at scale – connecting thousands to millions of your assets – has potential pitfalls. This article describes best practices and patterns to enable you to understand and avoid these pitfalls, helping you plan a successful IoT solution from the outset.

The importance of connectivity

Fundamental to IoT is the ability of assets to connect to MAS. The primary protocol to support IoT is MQTT[3], a protocol that IBM has helped develop and standardise. The MQTT protocol offers lightweight resource usage and the ability – in MAS – to scale to vast numbers of assets; MAS adds further enhancements to simplify and secure these assets. Full details of MAS’s support of MQTT are beyond the scope of this article, but at a high level are summarised by messages being sent to/from an asset, wrapped inside an MQTT connection.

 

Figure 1: Overview of MQTT connectivity provided through IBM Maximo Monitor. Assets connect over MQTT and engage in publish-subscribe messaging to send events and receive commands.

Three tenets underpinning a successful IoT philosophy

Connections are critical to IoT. Three fundamental tenets will underpin our philosophy and patterns.

Tenet T1: “Connections do not always succeed”. Just because an asset attempts an IoT connection to MAS does not mean it will succeed. There could be short- or long-lived issues preventing connectivity, and these issues could be localised (to the asset in question) or systemic across your entire IoT solution and estate.

Figure 2: Example causes of connection issues between assets and MAS, covering different durations and ranges.

Our philosophy and patterns will acknowledge the potential for these different classes of connection failures. Of course, it is possible to reduce these failures’ probability of occurring (e.g., through ensuring an appropriately HA and scaled MAS which we shall discuss in a later pattern), but these risks can never be fully eliminated.

Tenet T2: “Connections can, and will, be lost”. Despite improvements in connectivity across the globe, you cannot assume that a connection will run indefinitely. Let us classify three types of connection loss, because later we shall acknowledge and prepare for these in our patterns.

C1

Local

  • IoT asset’s local wifi connection is unstable.
  • IoT asset hard resets.

C2

Wide

  • Networking issue (e.g., network provider in a geographic region)
  • Worker node failure
  • MAS pod reschedule or update

C3

Global

  • Full loss of networking in the MAS deployment
  • Significant failure in MAS (e.g., total loss of a backing database)

The likelihood of these connection losses can again be mitigated, but not eliminated, through careful choice of infrastructure.

Tenet T3: “Connections are expensive to establish”. Or rather, “connections are expensive to establish at scale”. Individually MQTT connections are cheap – it’s deliberately a lightweight protocol – but when scaled up to hundreds of thousands of concurrent connection attempts, these costs add up. In particular, TLS negotiation is CPU intensive. Our design philosophy will recognise the potential cost of new connections when considering patterns.

IoT philosophy of scale: do you have an IoT society or an IoT army?

These tenets are fundamental in underpinning an overall strategy. We acknowledge that mass disconnect events (C2-C3) are possible through T2, subsequent reconnects can be problematic through T1 and reconnects can introduce significant load through T3. Let's understand and discuss the consequences.

It is simple to code an individual MQTT client to connect and send IoT data but translating this to a successful solution on the macro scale across potentially millions of assets requires care. Why is this? Although IoT assets connect individually, common algorithms in connectivity and messaging can lead to assets responding identically – and in concert – to external stimuli. This can lead to potential feedback cycles, ultimately forming an inadvertent denial of service attack: your estate has become an army. Perhaps the most illuminating analogy is to consider the Millennium Bridge in London, UK, where pedestrians’ walking behaviour – in effect, humans’ connectivity/messaging algorithm – led to amplifying bridge oscillations, driving and synchronising the pedestrian behaviour further in a spectacular feedback loop[4].

Rather than an IoT army, we want to build a peaceful IoT society, with assets acting as IoT citizens in a peaceful estate. As in all societies, consideration of others and adopting selfless behaviour can often lead to more productivity and success overall: this fundamental principle is taken forward in the patterns we describe below.

Patterns for successful IoT solutions

We now consider nine patterns that are hallmarks of successful scalable IoT solutions. These patterns readily compose with each other, mutually amplifying benefits: the sum is greater than the parts.

Pattern P1: Detect connection loss in a timely manner

When an MQTT connection is lost, it is not guaranteed that both ends will recognise the loss in a timely manner: this is a fundamental limitation of the underlying TCP protocol. To accommodate this, the MQTT protocol has a keepalive capability: the asset and MAS can negotiate a mutual frequency at which data will be sent over the connection (either true messages, or heartbeats). If either side does not meet this contract, the other end assumes the connection is dead and terminates it. Without adopting keepalive, critical connections can otherwise appear running but are in fact lost.

Anti-pattern

An anti-pattern is not to use keepalives. In this case, dropped connections are potentially not detected in a timely manner, leading to the client waiting indefinitely for traffic that will never arrive.


Figure 3: MQTT client experiencing a dropped connection caused by network disruption.

Pattern

In the recommended pattern, a dropped connection is detected in a timely manner, allowing the client to recognise and reconnect, maintaining the solution’s integrity. Keepalive times are a balance between minimising net downtime and minimising resource usage.

Figure 4: MQTT client experiencing a dropped connection caused by network disruption, and using MQTT’s keepalive capability to recognise the connection loss. The client can then initiate a reconned to restore connectivity.

Pattern P2: Identify and handle connection failures and losses appropriately    

Handling a connection loss or failure is more subtle than just blindly attempting a reconnect. Restoration of connectivity is the goal, but this needs to be done safely: blind reconnects as fast as possible are not necessarily the best approach as connection losses can indicate a systemic issue, and insensitive reconnects can compound situations because of the expense of establishing connections at scale. We propose an un-selfish citizen-based pattern to mitigate these concerns.

Anti-pattern

Assets individually try to reconnect as fast as possible – the “selfish” approach. Superficially this may seem to get your estate reconnected quickly, but in practice provides constant extra load on the system which may perpetuate, or cause further, connection issues.

Figure 5: Example behaviour of an asset repeatedly and rapidly attempting to establish a connection to MAS, and a representation of how such a process can – multiplied across a large IoT estate – impose unsustainable load on MAS’s IoT endpoints.

Pattern

A better pattern is to ensure IoT clients “back-off” connection attempts if chronic connection issues are detected. This can reduce load on the server, helping mitigate load-based causes of connectivity issues and – overall – leading to faster reconnections.

Figure 6: Example behaviour of an asset that recognises connection attempts are chronically failing, and so ‘backs off’ reconnect attempts to help reduce high load on MAS as being one of the causes.

Pattern

In this enhanced pattern, IoT clients also introduce a randomising element into the back-off algorithm to help reduce the possibility of “in-phase” connection attempts from large IoT estates

Figure 7: Example behaviour of an asset that additionally randomises reconnect times to help prevent in-phase connection surges in MAS.

 

Pattern P3: Acknowledge and handle “slow” connections

Suppose we have an IoT device or application that connects with an MQTT timeout of 5 seconds; i.e., the client ‘gives up’ and retries the connection after 5 seconds if a CONNACK is not returned. Under normal operation, this is comfortable: MQTT connections are usually quick to establish.

Figure 8: Example of an asset connecting to MAS successfully with a defined 5 second MQTT timeout

However, we have already stated that connections can be expensive to initiate – particularly at scale – and this can manifest itself as latency. Furthermore, MAS’s IoT technology has built-in denial-of-service protection to guard against compromised/malicious devices exhausting resources, and one of the consequences of this is that connections from devices can be deliberately delayed. If your asset “gives up” on a connection attempt too early and just attempts another connect, you may perpetuate the latency, making it hard for the asset to escape the connectivity issues. Let’s consider a scenario where latency is 7 seconds for a connection:

 

Anti-pattern

With a 5s timeout, the army of devices give up each connection attempt, failing to wait for the successful CONNACK arriving 2 seconds later. The assets treat this as a failed connection and simply attempt a reconnect. Coupled with perpetuating load on the server, MAS’s IoT technology will detect this behaviour as devices repeatedly stealing their own connections, leading to possible denial of service protection to be initiated, further throttling responses and latency. The device is caught in a vicious cycle, unable to escape from the failed connections.

Figure 9: Example behaviour of an asset that does not wait sufficient time for an MQTT CONNACK response from MAS. Note a connection is successfully established, but this is not recognised by the asset.

Pattern

In the pattern, devices adopt at least 30 second timeouts in TCP connections and MQTT connections. This tolerates a far wider range of (perhaps short-lived) spikes in load, helping prevent a connection storm from brewing.

Note that a longer timeout does not mean devices take longer to connect!


Figure 10: Appropriate MQTT connect timeout options helping assets tolerate latency in device authentication and connection.

 

Pattern P4: Decouple connectivity from messaging

An important pattern to adopt is to decouple connectivity from messaging as much as possible. What do we mean by this?

Well, consider a flow where after connecting, an IoT asset sends a message and waits for a response. If a response is not received, the connection is dropped and re-attempted. Once a response is received, the asset switches over to its business logic for sending events normally. An example here might be a request-response registration flow.

 
Figure 11: Schematic of firmware logic in which the asset’s connection success is dependent on a successful request-response flow, after which the asset ‘switches’ to running behaviour – e.g., sending sensor readings

This is considered a potential anti-pattern because any performance issues or bottlenecks in the downstream application or middleware processing the initial message have a direct effect on connectivity.

Anti-pattern

Suppose we have a mass C2/C3 event, and a large number of devices initiate a connection and send their initial message. Now suppose that latency in the downstream system (e.g., perhaps because of the sheer load of registration messages) means a response is not sent in a timely manner to all devices. Affected devices initiate a disconnect and reconnect, adding further registration messages to the backlog in the backend system, perpetuating latency. Note that latency can also worsen if further established connections fail (e.g., through natural C1 losses, or multiple C2/C3 events), adding to the army of reconnecting devices.

Figure 12: Example behaviour of a device that does not receive a response in a “timely” manner, leading it to disconnect and reconnect, further compounding the issue. Note how server load typically increases in such a scenario as other disconnect events lead to further assets entering connectivity limbo, adding fuel to the fire.

Pattern

In this pattern, connectivity is decoupled from any messaging. This helps add a firebreak between connectivity issues and performance issues in the downstream solution, helping avoid positive feedback cycles in which messaging and reconnection load increases.

Figure 13: Decoupling connectivity ensures that any latency in request-response for the overall IoT solution does not drive further disconnects in MAS.

 

Pattern P5: Build and test an appropriately resilient solution

We have already alluded to the causes and risks of connection loss. We now explore how the risks of loss, and their consequences, can be mitigated. Primarily, you should consider high availability of MAS and its dependencies, and prepare for and understand the impact of C2/C3 events on your solution.

Figure 14: Sample impact of a wide failure in a MAS deployment in a 3-zone multizone cluster. Note that the remaining zones must tolerate increased connection handshakes as well as handling rescheduling of the failed pods.

Pattern

  • Expose highly available MAS and IoT endpoints, preferably across multiple datacenters (e.g., using IBM Cloud’s multizone clusters), to provide resilience to hardware failures
  • Ensure surge capacity in the event of a failure: e.g., partial loss of an Availability Zone will trigger a C2 event, and so load will be expected to increase on the remaining infrastructure beyond simple pod rebalancing.
  • Robust testing at scale to understand how your assets and solution work under failure scenarios – detection and elimination of any unwanted/unexpected IoT “army” behaviour
  • Ensure your loadbalancers can sustain connections under scenarios. At the time of writing, we have observed that OpenShift’s native ingress can fail at 100+k, requiring adoption of provider-specific loadbalancers (e.g., NLB in IBM Cloud).

 

The remaining patterns talk in more detail about MQTT usage. These are most relevant if you are directly subscribing to device events rather than using the built-in forwarding technology.

Pattern P6: Use MQTT’s QoS appropriately

MQTT offers multiple qualities of service to help assure delivery of messages when required. Importantly, the higher the quality of service, the higher the cost of MQTT. We encourage you to review your message flows and only use higher qualities of service when appropriate – i.e., the loss of individual events has a demonstrable impact on your solution. For example, an asset sending sensor readings every 60 seconds: can those events be sent QoS 0? If an individual event is lost, your solution knows a new event should arrive in 60 seconds’ time. Adopting this pattern can reduce the cost of establishing and maintaining a connection, helping reduce the impact and cost of C2/C3 events.

Pattern P7: Use MQTT's cleanSession/sessionExpiry appropriately

MQTT offers durable subscriptions through its cleanSession=false (MQTT v3) and sessionExpiry (MQTTv5) settings, enabling messages to buffer up inside MAS for temporarily disconnected assets and applications. This can be useful for helping assure message delivery, but in some cases is counter-productive: there is finite resource for buffering messages, and accumulation of old messages can prevent the subscriber processing new messages quickly on reconnection. Furthermore, durable subscriptions require messages to be persisted to MAS storage, leading to lower performance/efficiency per asset than compared to the equivalent non-durable subscriptions (cleanSession=true in MQTTv3, sessionExpiry=0 in MQTTv5). Durable subscriptions should therefore only be used when necessary. Pattern 9c considers some techniques to help understand and mitigate MAS’s buffer limits.

Pattern P8: When using gateways, use wildcarded RLAC subscriptions judiciously

Your assets do not have to connect directly to MAS. Instead, you can elect to use gateways, allowing assets associated with the gateway to connect indirectly. Gateway devices can subscribe to asset-bound data on a wildcarded topic such as iot-2/type/+/id/+/cmd/+/fmt/+ and iotdm-1/type/+/id/+/#. Normally, subscriptions for these topic strings would receive messages for all devices but with the resource level access control (RLAC) feature of MAS, messages are filtered so that only messages for devices in the RLAC group of the gateway are received. This RLAC-based filtering is much less efficient than the highly optimised filtering of topic strings so gateways with only small numbers of devices in their group should make subscriptions for each individual device rather than relying on RLAC group filtering.

Pattern P9: Backend MQTT applications need to cope with surges and disconnections

Your IoT solution may include custom applications consuming MQTT events directly from MAS, for example through performing bidirectional communication for control and configuration of fleets of assets as suggested in Figure 12 and Figure 13.

Figure 15: Role of a custom MQTT application in processing bidirectional communication with an asset.

These backend applications can become a bottleneck in IoT architecture – e.g., through Pattern P4. In particular, applications expected to process large numbers of messages from a significant IoT estate, either at steady state of under a C2/C3 incident, may require scaling to multiple instances to handle the load. MAS supports such a model, referring to the applications as “scalable applications”, identified with MQTT clientIds prefixed by “A:”. Here we describe specific aspects to consider for such scalable applications.

P9a QoS, Redelivery & CleanSession

First, let us describe best practices to support the different qualities of service MQTT offers to your application. We highlight potential anti-patterns around adopting QoS 2 “exactly once delivery” semantics: although superficially offering the “best” quality, in practice a lower quality of service allows much simpler, more efficient and robust applications. This builds on the recommendations already suggested in Pattern P6.

Refer to your MAS and MQTT client documentation for more information on using the recommended clientIds, cleansession settings and administrative subscriptions.


Messages consumed by app

Recommended patterns

QoS 0: “at most once delivery”

Under QoS 0, you have the most efficient type of application and you should therefore use this QoS where possible.

If messages need to be buffered when all instances of the application are temporarily disconnected – e.g., under a C3 event – then creating an administrative subscription of type=nonpersistentAdmin and using scalable applications with a 3-part clientId is recommended.

QoS 1: “at least once delivery”

Under QoS 1, consider whether messages need to be persisted across restarts or upgrades of MAS. If persistence is not required, then best practice is the same as for QoS 0 messages above. If persistence is required then use an administrative subscription of type=admin, scalable applications with a 4-part clientId and cleansession=true (MQTTv3) or sessionexpiry=0 (MQTTv5).

QoS 2: “exactly once delivery”

Under QoS 2, your set of scalable applications need to connect and reconnect with the same clientId. To avoid duplicate processing, in-flight messages destined to one instance of an application (determined by the clientId) cannot be delivered to any other instance. If an instance disconnects, any in-flight messages for that instance will be trapped pending delivery until the instance returns, being destroyed after 45 days if the instance does not return. To prevent message loss when removing instances permanently (e.g., scaling down an application), the instance must unsubscribe and process any in-flight messages.

This delicate shared state across server and client instances is a good reason to use lower qualities of service unless necessary.

 

Pattern P9b: Message surges

A common challenge faced by backend MQTT applications is handling surges in message numbers (e.g., after a C2/C3 disconnection). You should consider and test surges, recognising that MAS will not buffer messages infinitely or indefinitely. The actual buffer limit depends on the policy of the administrative subscription used, and so you should plan for appropriate scaling and response times to minimise the risk of buffers filling up.

Pattern P9c: Monitoring

MAS provides a REST API to provide details on applications’ subscriptions, including the number of messages buffered. For important applications it is best practice to use this API regularly to ensure that message buffering limits are not being reached, helping you understand when to scale the number of instances of your applications.

 

Pattern P9d: Monitoring

Be careful of using random clientIds in your applications, especially with each connection. This can make debugging issues difficult (your client’s identifier changes with each connection), and can leave to orphaned state in MAS (e.g., orphaned subscriptions). Predictable and reusable clientIds are generally much more successful, especially when supporting Pattern P7a.

Summary

Nine patterns have been introduced to help plan for a successful IoT solution within IBM Maximo Application Suite. By considering connectivity issues and challenges we help you prepare for, and handle, connectivity issues in a large scale IoT solution.

[1] https://www.ibm.com/products/maximo

[2] https://www.ibm.com/uk-en/products/maximo/remote-monitoring

[3] http://docs.oasis-open.org/mqtt/mqtt/v5.0/mqtt-v5.0.html

[4] https://en.wikipedia.org/wiki/Millennium_Bridge,_London#Resonance


#Maximo
#AssetandFacilitiesManagement

Statistics
0 Favorited
29 Views
0 Files
0 Shares
0 Downloads