Organizations operating in sectors such as healthcare and finance are increasingly looking to leverage large AI models without compromising the confidentiality of sensitive data handled by cloud servers. Researchers have introduced a cryptographic method known as Secure Multi-Party Computation (MPC), which allows data to be divided into encrypted segments and distributed across multiple servers. This method enables these servers to compute results using AI models while ensuring that no server has access to the raw input data.
However, the use of MPC comes with a significant drawback: speed. While a conventional mid-sized language model can generate results in under a second, processing the same data through MPC can extend the time required to over 60 seconds due to the encryption overhead.
Limitations of Existing Solutions
Previous efforts aimed at enhancing private inference have often focused on redesigning AI models to reduce operational costs under encryption. Although these initiatives offer some improvements, they are limited by a common structural issue: every query, irrespective of its complexity, incurs the same computational cost when processed through the same model.
In typical AI applications, a common optimization strategy is to direct simpler queries to smaller, faster models while reserving larger, resource-intensive models for more complex queries. Implementing this routing mechanism in an encrypted environment is challenging because such decisions typically necessitate access to the input data, which must remain encrypted.
Introducing SecureRouter
A team of researchers from the University of Central Florida has developed SecureRouter, a system that integrates input-adaptive routing with encrypted AI inference. This innovative system manages a diverse pool of models, ranging from a compact model with approximately 4.4 million parameters to a considerably larger model with around 340 million parameters. A lightweight routing component evaluates each incoming encrypted query and selects the most suitable model from the pool, all while maintaining the encryption of the data. The routing decision remains undisclosed in plaintext.
The routing mechanism is designed to balance accuracy against computational expenses, where cost is defined in terms of encrypted execution time rather than traditional parameter counts. Additionally, a load-balancing objective prevents the routing component from favoring a single model for all incoming queries.

Performance Improvements
In comparative tests against SecFormer, a private inference system reliant on a fixed large model, SecureRouter achieved an average inference time reduction of 1.95 times across five language understanding tasks. Speed improvements varied from 1.83 times on the most challenging task to 2.19 times on the simplest, showcasing the router's capacity to align model size with the complexity of incoming queries.
When contrasting the performance of SecureRouter with the approach of deploying a large model for every query, the average speedup across eight benchmark tasks was found to be 1.53 times. In most cases, the accuracy of results was within a small margin of error compared to the large-model baseline, though one task focused on grammatical analysis exhibited a more significant decline in accuracy, indicating that certain specialized tasks may be more sensitive to handling by smaller models.
Minimal Overhead
Introducing a routing layer into an encrypted inference system could potentially create a bottleneck. However, the routing component has been shown to consume only about 39 MB of memory in a two-server configuration, which is comparable to the 38 MB required for the smallest model when run alone. In contrast, the largest model in the pool demands approximately 3,100 MB. The addition of the router contributes an estimated 4 seconds to the inference time and approximately 1.86 GB of network communication, metrics that are on par with operating the smallest model independently.
Practical Implications
The SecureRouter system is designed to integrate seamlessly with existing infrastructure, requiring no major overhauls. It operates atop current MPC frameworks and utilizes standard language model architectures that are readily accessible through common libraries. This allows for efficient resolution of straightforward queries with smaller models, while more demanding queries can be escalated to larger models. Clients submitting queries receive only the final output, with no visibility into which model was utilized for processing.
Source: Help Net Security News