Real-time rating system. A one manday prototype

This blog post is a companion to the real-time rating scenario placed on the Nussknacker demo site

Why Real-Time Rating Matters More Than Ever

Rating is the process of determining the cost or value of a service based on predefined rules. It plays a critical role in industries such as telecommunications, finance, transportation, and utilities, where services are consumed dynamically and often require immediate cost evaluation.

Traditionally, rating has been performed in batch mode - collecting usage data over a period and processing it in bulk to generate bills or reports. However, in many modern applications, batch processing is insufficient. Customers expect real-time feedback on their usage and costs, and businesses need to enforce limits, apply dynamic pricing, or trigger automated actions instantly.

To meet these needs, many organizations adopt an online synchronous architecture, where each transaction triggers a rating request that updates the customer’s balance or applies necessary charges. While this approach works up to a certain point, it eventually struggles under high transaction volumes. The core challenge is state management: real-time rating requires continuously tracking balances, quotas, or accumulated usage across millions of customers. Without efficient state handling, the system cannot scale, as each request needs access to the latest state before processing can continue. This is where stateful stream processing emerges as a necessary evolution.

You can read more about the evolution from synchronous architecture to stateful stream processing here.

 

Real-time rating example 

Telco rating introduction

The data and rating logic used in this scenario is typical for a telco mobile operator; however, the concepts presented in this example can be applied to any real-time rating problem. 

 

The processing is divided into several subgraphs; the subgraphs are connected using Kafka topics. In real case applications, each of these subgraphs would constitute a dedicated Nussknacker scenario. Please note that the term subgraph does not come from the “official” Nussknacker glossary - it is just a useful name for those disconnected algorithm parts of the scenario which you would edit with the Nussknacker Designer. The disconnected can be misleading here - the subgraphs are connected via Kafka topics. Presenting the overall rating “problem” as several subgraphs serve two purposes. First, it allows us to divide the processing into several smaller parts, which makes it easier to understand. Secondly, you can see events produced by each subgraph in a Kafka topic - just navigate to the Topics link in the top right corner of the scenario window. 

 

The following functionality needed in the rating is shown:

  • Enriching CDR with the customer profile
  • Simple rating - rate is proportional to the usage (call duration in our case)
  • Tiered rating - the cost of the call depends on the overall usage
  • Reacting in real-time to the change of the balance sign

 

The following functionalities are not shown:

  • Closed group rating - this is just the variation on the “enriching CDR with additional data”. If the customer profile had information on the closed group the given msisdn belongs to, an appropriate tariff would be applied.
  • Zones based rating. Typically the cost is different if the called person is in a different zone - being it an administrative region or county. This type of decision logic can be easily handled by Nussknacker’s decision table

 

Rating scenario overview

Real time rating on kafka

    1. Common preparations. This subgraph enriches the CDR record with customer profile information - notably the billing plan id. As it would be difficult to simulate and visualize here the effects of ever-changing customer profile information, this enrichment is implemented with a Nussknacker fragment which returns some faked data. See the next section for the discussion of how this would be implemented in a real-life scenario.
    2. Rate plan subgraphs. There are two such subgraphs - for rate plan A and for rate plan B. The rate plan A subgraph is pretty straightforward, there is probably nothing worth explaining.
      In the case of rate plan B (tiered), the endOfPeriod events are used to close the time window. Closing the time window effectively means that the aggregation is reset (to zero in the summation case). Using endOfPeriod events allows tiered tariffs applied to time windows of arbitrary length. You can read more about time window processing with Nussknacker here.
      The tiered rate plan is implemented with the decision table component - perfect for cases where complex decision trees cannot be easily modelled using filter/switch/split nodes.
    3. Aggregate and Act subgraph. The event rates are summarized in the session window. The reason behind choosing a session window is that it gives a lot of control on when and how it is closed. If we want the customer's positive balance should expire after some inactivity period (the customer does not make any calls) just set the session timeout accordingly. If you want the session window to never close, set the timeout to maximum value (106000 days =~ 290 years). Alternatively, you can formulate an end-session condition, based on the date and time, the content of the data, etc.
      The session window enriches the CDRs with the balance information.  If the balance changes sign, an action is generated - enable or disable outgoing calls. 

 

How stateful stream processing makes real-time rating superior to traditional approaches?

 

  • Customer balance is kept in memory all the time. This makes the updates to the balance very fast, effectively reducing load and improving throughput. 
  • The customer profile is kept in memory all the time. In a more realistic scenario customer profile (or the information needed for rating) would be kept in memory by Flink to avoid continuous round trips to the external system where customer profile is kept. The connection to the system which keeps the customer profile would be configured as the changelog connection - this would allow it to automatically update the customer profile kept in Flink memory. 
  • Uniform architecture. The overall architecture is fairly simple and very uniform - just Flink, Kafka and the system keeping the customer profile data. Having Apache Kafka on the input and output of the rating system serves well the continuous stream of data nature of the processing. 
  • Processing parallelization is handled by Flink. The application does need any logic for sharding (dealing with parallel instances of servers, processors, queues, etc), or routing to appropriate shards - everything is handled by Flink. Just state your parallelization level. 

 

Real-time rating solution based on Nussknacker

Even though I have been using Nussknacker for years, the first thought is always the same - it is just so easy to build Flink-based applications with Nussknacker. The implementation of the subgraphs implementing rating took just a couple of hours. I spent much more time on the subgraphs generating test data and dealing with imperfections of the test data generation: after some time all the msisdn’s were ending up with large negative balances and nothing interesting was happening. In the process of development I deployed over 100 versions - it is just so easy to modify the logic and deploy. The counts functionality helped me to understand where my logic did work as I wanted.  One way or another, in a matter of a couple hours I created a basic real-time rating system.

 

The data transformations required were very simple - just basic mathematical operations. The power of the approach where an expression language is used was more evident when it came to generating test data. Anybody who uses spreadsheets knows that you can do any data transformations with spreadsheets (other than programming loops). The same with Nussknacker.

 

The core requirements specific to real-time rating - low latency, high throughput, time-windows-based logic (session windows in our case), handling state (“customer balances”), and recovery from failure were provided by Flink via Nussknacker. The Flink “connection” was felt very lightly - 95% of my thinking and implementation effort was devoted to data and their transformations rather than Flink internals, programming API, Kafka, etc.

 

Please let me know if there are other rating requirements that you like to see prototyped with Nussknacker.

Zbigniew Malachowski