HomeBlogExpertiseHow Has TEKTELIC Implemented the Usage Statistics Service? Explained
4 min reading
18 April 2023
18 April 2023
How Has TEKTELIC Implemented the Usage Statistics Service? Explained
TEKTELIC provides LoRaWAN Network services daily and with the rapid […]
TEKTELIC provides LoRaWAN Network services daily and with the rapid scaling of our network and number of users, we require flexible Usage Statistics of our service for analysis and customer billing. Further, in the article, we will tell you about our search for the most suitable solution for the Usage Statistics Service and the challenges we faced along the way. We want to share our experience in building highly loaded data acquisition services to help you better understand why we have chosen this particular one, and what are its benefits and limitations.
Problem to Address
TEKTELIC technical team processes data from the customers’ IoT deployments and delivers the parsed payloads to applications of their choice via different channels all the time. Approximately, we receive 50 messages per second from 500K devices of our 1000 customers. Such a huge scale requires flexible Usage Statistics for:
Better data analysis
Optimized operation processes
With the high scale of operation and data turnover, there is a range of requirements for a reliable Statistics Service.
TEKTELIC business domain includes:
Devices (communicate with Gateways via LoRaWAN protocol)
Gateways (pass data between Devices and the Server).
Integrations (connections between the Server and customer Applications).
Hierarchy of users (Providers, Customers, Sub-customers).
With such an amount of entities, the solution should be able to provide usage statistics per every entity listed. For instance, it should be able to answer such questions as:
How many active devices the Customer A had during this day?
How many messages were delivered for specific Device Group during this month?
This data should be stored for at least one year, so the technical team can check it and refer to it when needed.
Previous Technology Stack
Before implementing the Usage Statistics Service, our technology stack included:
Postgres (mostly stores domain objects, like Device, Gateway).
Redis (is mostly used as a cache for the objects stored in Database (DB), but some frequently updated data is stored in Redis only, like the device’s last online timestamps).
Kafka (exchanges messages between services)
AWS Timestream (saves historical time-related data, e.g. device packets log).
Understanding that for statistics tracking and analysis TEKTELIC has quite a lot of tools and platforms, we’ve started to analyze the options for implementing the feature using the technologies we already had at hand, while decreasing their number.
Possible Solutions Considered
REST API script (existing solution)
TEKTELIC already has an ad-hoc script for analyzing service usage for some customers. It uses REST API to receive all the required data and is executed manually once per week. However, this solution is hard to scale for all the customers since it takes a long to finish and adds a significant load on DB.
The major drawbacks are as follows:
Takes too long to finish the script
For large customers, it takes several days for the script to finish due to the number of REST requests needed.
DB load is too high
For the normal data processing flow, DB is rarely accessed directly since all the entities are cached by Redis. Although for the usage analysis scripts, the DB has to be queried intensely since not all use cases can be cached (e.g. fetching every Device from a Device Group). This script alone causes the DB load to grow x5 times the usual production level, putting the main operations in danger.
Requires additional query costs
Some of the data needed for the script is located in a pay-per-usage AWS service (Timestream). The script execution grows our monthly Timestream bill x2.5 higher.
Blocks DB cost optimization
Due to periodic load spikes, we would be not able to downscale our DB, thus significant cost reduction would be impossible.
Operates only with the current data
The script has access only to the snapshot of data at the moment of execution. Thus, it cannot calculate, for example, the maximum number of devices during the month.
TEKTELIC already stores device packet logs in Timestream which could be used to calculate some usage statistics, but we had several concerns about it as well in terms of scaling it as a solution.
The main concerns were as follows:
Slow for historical data
Timestream is a time-series database that stores recent data in memory and older data on disk. The disk queries, in turn, can be very slow (up to 5 seconds for a single request). Even though it can be configured to contain more data in memory, that would increase the price ludicrously.
Queries can be costly
As mentioned in a previous point, with a seemingly low price ($0.01 per GB) it can result in a noticeable bill if you store enough data in memory rather than on a disc.
Lack of experience in production
This technology was introduced fairly recently to our stack and we are still gathering experience on using it in production.
Hard to add new columns with historic data
Unlike SQL databases where you can add a column and populate it by joining on other tables, with Timestream it is practically impossible. From our experience, when we’ve tried inserting historical data into Timestream, we quickly ran into throttling due to AWS limits and I would expect a similar behavior for updating old data.
TEKTELIC has thought of using Postgres to create the Usage Statistics Service. However, as a rule, we try to avoid storing Real-time data in Postgres, so this option was quickly rejected.
Another of our options was Redis, and considering that one of the recommended Redis use cases is gathering statistics/scoreboards, we’ve decided to analyze it further.
We’ve taken a closer look at all the potential benefits and drawbacks of this platform before making any final decisions.
The advantages of the platform include:
Since Redis operates in RAM, it can handle the additional load without increasing the CPU too much. This includes both write and read operations. It allows us to use the existing Redis server without increasing the infrastructure budget.
We can store the data in the exact same way we are going to fetch it – just using key-value (counters) would be enough. Since Redis seems to cope with additional load, there is no need for additional in-memory aggregation. It means that the data could be pushed to Redis as soon as it is received, resulting in a straightforward implementation.
At the same time, we have noticed some drawbacks as well. For example, it could take additional space in Redis memory. RAM is a valuable resource and it could take a noticeable amount of memory if we decide to store the historic data in Redis as well.
All things considered, this option seemed to be more promising than others, so we’ve decided to go with it.
One of the scenarios for which we considered Redis to be a great fit is calculating the number of unique active devices. It can easily be implemented with Redis sets (adding a unique device identifier on write and fetching the set size on read).
However, such sets take up lots of space (hundreds of MBs of RAM per day). Thus, we had to introduce a cleanup logic that will replace the Redis set with an integer, containing its size only after the set population is finished (a new day or month has started).
Also, we had to deal with empty periods (e.g. a month when a customer added no new gateways). In order to calculate the correct Max Gateways per such period, some initialization logic had to be added.
In general, we wanted to make this service as agnostic as possible, so it can be reused for other possible use cases. As such, the created service was only calculating statistics on the provided data, without knowing the meaning of it.
As a final solution, we’ve decided to combine Redis and Kafka as core platforms for the Usage Statistics Service.
The process is now as follows:
Statistics data is sent to the Kafka topic and fetched by the Statistics service.
It contains the metric name, date+period, identifier (e.g. customer ID), and a statistic to calculate (e.g. Sum or Max). As mentioned earlier, such a design would help us add new metrics or statistics in the future (for example, Average or Min).
Redis key is calculated using the provided date so that on the next day there will be a new key to aggregate statistics and the old key will not be touched anymore.
These could be called “hot” keys for the current day and “cold” for the previous ones. Basically, there is no need to leave the cold keys in Redis, since they just take up space (e.g. they could be migrated to Postgres after the period is over). But to keep the service simple, we’ve decided to leave the cold keys in Redis for now. Removing such keys is super easy as well – just add a TTL during the key creation and it will be removed by Redis as specified.
Currently, the service is used for calculating the following metrics:
The Number of Active Devices/Gateways
The Max Number of Devices/Gateways
The Number of Messages sent/received by the Server
The Total Volume of data sent/received by the Server
Each metric, in turn, is available for each day/month and specific Provider/Customer/Integration/Gateway or Device Group.
Objectively accessing the current solution as well, we can outline some pros and cons as well as areas for improvement.
The service uses noticeable memory volume since we keep historic data in Redis.
Moving cold data into cheaper storage.
Redis has no SQL joins, so a separate HTTP request is needed to fetch the specific counter.
Creating batch APIs that will return data in a single response (e.g. number of Active Devices per every Sub-customer for the provided Customer).
No additional cost
After a thorough consideration of the different options for the Usage Statistics service and their functionality/capabilities for scaling, TEKTELIC chose the best possible variant available in terms of functions, scalability, and price. We still looking into ways of improving this solution in terms of finding cheaper storage for “cold” data and creating batch APIs that will return data in a single response. But in general, the technical team and the customers are satisfied with the proposed solution so far as it addresses all the primary needs of statistical analytics.