General Profiling for improving Iroha2 TPS

Status	IN PROGRESS
Stakeholders	Aleksandr Petrosyan Marin Versic
Outcome
Due date
Owner	Sam H. Smith

Background

CPU time and memory cost money. Faster software is better because it can do more with the same hardware. Adding new functionality is only useful so long as users are able to use it. Performance enables more use. It is therefore important that start focusing on improving iroha2 performance. We expect that a 50x performance increase is an achievable goal.

Problem

Informed decisions require metrics

It is not hard to walk towards a goal on the horizon. But if you are blindfolded, that task becomes impossible. You will drift off course without something to guide you. The same goes for the iroha2 core team. We have a good idea of where we would like to go. But we will not get there unless we have good guide, aka, good metrics. Specifically, real world measurements of a network's throughput, it's transactions per second.

Death by a thousand cuts

Currently the iroha2 codebase suffers from general slowness. Slowness that is very hard to pin down to any particular place in the codebase. There is no obvious bottle neck we can address. Instead we are faced with needing lots of small improvements in many places. The measurable performance difference due to any specific change is negligable. But combined they are what will get us to 50x. Even though we cannot use metrics to decide what change is good. They are still useful in making sure we have not made things worse. If you know that you haven't introduced a regression in performance you can refactor and simplify with confidence. This will allow us to do necessary optimization/simplification faster.

Solution

I will be regularly performing TPS benchmarks on a set of four machines. This will allow the iroha2 core team consistent insight into how code changes are affecting performance. I will establish a baseline TPS for the LTS release. That way we can make sure all our codebase simplications are improvements and not regressions. Some of this work can be handled by devops once the routine has been established.

In terms of what changes we should make. There are a few key patterns we need to clean up in my opinion. The first is synchronous asyncronous code. In many places we have async functions whose only purpose is to asynchronously call another function. It is not uncommon for chains of these functions to be 3 or 4 calls. This code executes synchronously. Therefore it should be written like normal singlethreaded synchronous rust code.

The second pattern to be changed are "call and response" actor messages. We currently send node internal messages and await their responses. The thing to realise about this pattern is that we are doing exactly what asynchronous functions were invented to do. Except here we call ten times the amount of asynchronous functions than we need to. We heap allocate memory in order to create a oneshot channel and do lots of other pointless work. Node internal call and response messages should simply be asynchronous functions.

There are many more things such as these yet to be discovered. I am confident we can reach 50x. The question is simply how quickly we will get there. Either way performance testing and profiling is essential.

Decisions

Sam H. Smithwill do performance testing on a 4 node benchmarking stand.

Alternatives

We could try set up automated performance testing. - currently quite difficult

Have devops do some of this work.

Concerns

There is a concern that time spent optimizing will not be fruitful. Or that the feature requirements will change so that the optimization work is made redundant.

Assumptions

We have assumed iroha2 is at least 50-100x slower than it needs to be.

Page tree