Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Operating as an execution layer (EL) client, Besu is responsible for validating and executing each block sent by the consensus layer (CL) client every Slot (12 seconds). Fast and effective block execution and validation are key roles of an EL. Moreover, there's a window of 1/3 * SECONDS_PER_SLOT (4 seconds) at the start of each slot for each validator to offer their local chain perspective, and attest, once per Epoch (6.4 minutes). In essence, the validating nodevalidator node (CL+EL) has a 4-second interval from the slot's beginning to receive the block (gossip time), validate and execute each separate transaction on the EL, and confirm the correct head on the CL side. As such, the faster the EL can execute the block, the better, since some blocks may arrive late, even after 3.5 seconds from the slot's start, requiring the EL to execute the block as quickly as possible to earn the full attestation reward. It's important to highlight that about 85% of validators' rewards come from making attestations. With these considerations in mind, this article will focus on Besu's improvements to block processing performance.

Besu block processing performance Profile

We're eager to share that since the merge in September 2022, Besu team improved Block processing performance by three times reaching a 95th percentile around 250 ms and 99th percentile around 500 ms on a solo staking hardware machine (AMD Ryzen 5 5600G, 32 GB DDR4 3200MHz, 2TB WD Black SN850 NVMe).

A TL;DR is provided towards the bottom with our recommended staking configuration.

Besu block processing performance Profile

To boost a client's performance, it's To boost a client's performance, it's essential to collect metrics, evaluate the existing state, and then plan enhancements based on these insights. Extra data is also needed to identify which parts of the code consume the most time during the program's execution. Besu offers a plethora of metrics covering different facets of the execution layer, including block processing (newPayload calls), p2p, RPC, the transaction pool, syncing, the database layer (RocksDB), among others. These metrics can be gathered or extracted by any monitoring system like Prometheus and can be displayed using a visualization tool such as Grafana with the existing Besu Full dashboard

...

Since the merge in September 2022, we've been committed to improving Besu's block processing performance, following a significant number of user reports about missing attestations on their validators. We've succeeded in boosting the performance by three times, lowering the median time from 1.71 seconds to 0.57 milliseconds 49 seconds on the m6a.xlarge AWS VM, and the 95th percentile from 2.98 seconds to 0.98 81 seconds. It's important to note that the m6a.xlarge AWS instance comes with 4vCPU, 16 GiB, and lacks NVMe. Most of the improvements have been made specifically to the Bonsai data layer implementation. If you're using Besu with Forest and experiencing performance problems, we suggest switching to Bonsai.

Image RemovedImage Added

On better instances, such as the mid-spec VM on Azure Standard_D8as_v5 (8 vCPU, 32 GiB with a remote SSD disk), Besu's performance is even better. As shown in the screenshot below, both instances exhibit a median time of approximately 250 milliseconds and a 95th percentile around 410 milliseconds.

...

One of the biggest improvements on SLOAD operation was the implementation of the healing mechanism of the Bonsai flat database on accounts and storage. Bonsai is a feature in the Besu Ethereum client that operates with both a flat database and a trie structure simultaneously. This unique combination allows for faster and more efficient SLOAD operations.

...

Besu running without a complete flat database

Image Modified

Besu running with a complete flat database

Image Modified

Another improvement we introduced was turning off checksum verification during reads on RocksDB. This is because there are already different validations in place as per the Ethereum client specifications. To make this flag working, which it wasn't initially, the Besu team had to contribute to the RocksDB project throughthis PR.

We also focused on fine-tuning RocksDB by activating the bloom filters. Additionally, we introduced a When a read request comes in, RocksDB uses the Bloom filter to check if the key is present in an SST (Sorted String Table) file without having to scan the entire file, this helps to speed up read operations. Additionally, we introduced a new flag for high-spec machines (with RAM > 16 GiB) to utilize more block cache, thereby reducing IO activity during SLOAD. The high spec flag can be enabled with this Besu option --Xplugin-rocksdb-high-spec-enabled.

SSTORE operation improvements

Besu has also benefited from the updates and releases of RocksDB, which have led to enhancements in performance.

SSTORE operation improvements

A lot of the A lot of the time spent on the SSTORE operation is used to figure out the storage gas cost and the refund amount. This is done by looking at the current value, the original value, and the new one. This entire algorithm is outlined in the original EIP-1283.

...

The second improvement involved caching empty slots in the Bonsai accumulator. This optimization positively impacted both SLOAD and SSTORE operations. However, the improvement was more noticeable in SSTORE operations, as some of the original and current values can be empty. This PR had a huge impact on SSTORE execution performance as we can see in the CPU profiling below

Before the optimization

Image Modified

After the optimization

Image Modified

General EVM improvements 

In addition, the Besu team has been consistently working continued to work on enhancing the overall performance of the EVM. Engineering efforts from Hedera have focused mostly on non-storage related EVM performance. Most of these performance improvements came from re-writing core portions of the EVM with performance in mind instead of using idiomatic Java patterns.

EVM Benchmarking

Improving performance without a consistent reference is akin to wandering in the desert at night: given enough time and space you will  wind up where you started. A simple benchmarking suite called CartEVM (cartesian join + EVM) was written.  For every operation it creates a state test that exercises the same operation as many times as possible with a minimum of other supporting operations.  Executed with large contracts, at large gas limits, and 

Image Removed

creates a state test that exercises the same operation as many times as possible with a minimum of other supporting operations.  Execution with large contracts, at large gas limits, and inside a profiler has provided a wealth of optimization opportunities.  Notable for CartEVM is that it can be configured to exercise all combinations of multiple opcodes exhaustively (the "cartesian join part") and that has unveiled some unexpected performance impacts.

Below is a summary table run with Java 21 on a M1 Mac Pro. Java 21 offers a 10-20% boost just for using it. Java 17 runs are more flat after 22.10.0 but are also lower than Java 21 runs in all cases.  5 runs of each operation were executed and the median and max values are shown. The results against all operations and a few select problematic values are averaged together to provide the number in this graph.  There are two notable bumps on this graph, at 22.1.0 and 22.10.0, while the work after 22.10.0 have been focused on fixing some worst-case performance scenarios.

Image Added


Native Types Transition

The first bump at 22.1.0 is mostly because of transitions from rich Java types for items such as Gas and transitioning stack items to simpler data types.  This netted out to nearly doubling non-storage gas per second of the Besu EVM.

Inline EVM Operations Loop 

The next large bump came in 22.10.0 with the introduction of an inline operations loop.  Prior to this all operations were subclasses of a single Operation class, and the loop selected an Operation object (following the Gang of four's Strategy pattern) and then called an overloaded method to execute the operation agains the current VM state. This "megamorphic method" is a know performance problem in "hot" Java code. Java JITs work best against "monomoropic methods" where there is only one or two possible receivers per call site.

To address this dispatch problem all of the high frequency operations were unwrapped into a single large switch statement, with each case statement calling a static form of the operation. The operations could still be used as virtual objects to build a custom VM based on the old pick and execute engine. Note that not all operations were unwrapped.  More expensive and less frequently used operations didn't exhibit much performance improvement from unwrapping their dispatch into the switch.

Not hard-wiring in all operations also preserved important functionality for Hedera, Linea, and other downstream users of the EVM Library. Being able to override non-performance critical operations with operations more suitable for their use case increases the utility of EVM Equivelance in non-mainnet use cases.

Security-related Optimizations

Most of the jitter seen in the past three quarterly releases are Jitter related to worst case performance scenarios predominantly found by Guido Vranken (not all findings published yet). Some of the best findings have come from his work in fuzzing multiple EVM implementations and sharing the slow executing cases. While these cases are not typical in mainnet use (otherwise we would have seen them before) they could easily be used as an attack to slow down Besu nodes and in extreme cases stall networks. This work also helped fill out missing features from the Besu EVM Fluent AIPs.

Commit phase improvements

...

In our efforts to decrease memory usage in Besu, we initially compared various system memory allocators: Malloc (default), TcMalloc (Google), Mimalloc (Microsoft), and Jemalloc. Based on the metrics we gathered, we found that Jemalloc outperformed the others for Besu's workload, reducing Besu's memory usage by over 40%. This significant reduction is primarily attributed to superior memory defragmentation and enhanced multithreading. 

We also observed that setting MALLOC_ARENA_MAX to 1 or 2 significantly reduced the Resident memory usage. At present, when operating Besu with an official Docker image, it comes with jemalloc and MALLOC_ARENA_MAX set to 2. If you're running Besu natively, it will indicate whether it's using jemalloc Jemalloc or if it's not installed. If Jemalloc is not installed, you will see a message in the Besu logs stating, 'jemalloc library not found, memory usage may be reduced by installing it'.

Image Modified

We also observed that the OpenJ9 JVM implementation exhibits a better memory footprint, primarily due to its garbage collection (GC) implementation (GenCon) which more frequently frees up memory.


Image Modified

Given this behavior with OpenJ9, we opted to adjust the Garbage Collector (G1GC) of the HotSpot JVM implementation to mimic this behavior without affecting performance. We introduced three flags -XX:G1ConcRefinementThreads=2 -XX:G1HeapWastePercent=15 -XX:MaxGCPauseMillis=100, which are now incorporated into the besu.sh script, and the outcome is similar to OpenJ9 JVM implementation.

Image Modified

With the default JVM installed (Hotspot), users simply need to check the logs to determine if Jemalloc is installed. If it's not, we recommend installing it to reduce Besu's memory footprint.

...

  • Enable the flat database healing with --Xsnapsync-synchronizer-flat-db-healing-enabled=true
  • Enable the high spec flag if your machine has more than 16 GiB RAM --Xplugin-rocksdb-high-spec-enabled
  • Install Jemalloc to replace the default system allocator (malloc). Besu will automatically detect if it is installed and Preloaded as the system memory allocator.

If you need to troubleshoot performance, check out our documentation here.

Here’re some other best practices :

...

  • 95th percentile around 250 ms
  • 99th percentile around 500 ms

Image Modified

Future work around performance

...