View Source

Status
Stakeholders	Makoto Takemiya Aleksandr Denisov
Outcome	What did you decide?
Due date	When does this decision need to be made by?
Owner	Ivan Rybin

Background

A rich toolkit for collecting information about the execution of decentralized and distributed systems is the only way to react in time and notice failures

Core requirements

Distributed design requires continuous collection of information from all nodes of the system, regardless of its integrity (ex. connection issues or desync)
A single solution regardless of the final configuration (a local network of developers from several nodes, a distributed business network with different administrators, or even a public network)
Isolation of public and private information without disclosure

Overview

Accordingly, with point #1, nodes must transmit their telemetry data to the configured endpoint. Aggregation and caching for future analysis is the responsibility of the end service.

Based on the nature of data node selects public or private endpoint as the destination point. All transmission is done in a push model from the node to the end service.

Pic #1:

As shown in the picture, participation in the collection of telemetry data is optional and Company C avoids it. To respect this choice, the publicly available data should not contain any details about company C nodes.

End service

To support various monitoring and analytics tools, the end service can run in three different ways:
1) As a simple proxy of incoming data to support tools with push-model. The service will wrap incoming data with node id and push forward.
2) As a queue. The service will cache all incoming data until a future request from the consumer(s).
3) As an aggregation layer. The service will store all incoming data and calculate summaries for future requests.

Data

Raw messages from nodes

Category	Public	Private
Initial	network_id: string chain: string name: string version: string location?: string public_key: pubkey startup_time: number	os_name: string ip_address: string port: string modules: actor_slug[]
Tick	connected_peers: integer tx_count: integer bandwidth_upload: float bandwidth_download: float finalized_height: integer finalized_hash: hash latest_height: integer latest_hash: hash	available_ram: integer available_disk: integer cpu_consumption: float
Block	type: proposal \| vote \| commit \| finalized height: integer hash: hash validators: pubkey[] transactions: hash[]
Transaction	hash: hash size: intereger signatories: pubkey[]	data?: blob
Connection	type: discovered \| connected \| disconnected ping: integer public_key: pubkey karma?: integer	ip_address: string port: string

Aggregations

Peer based:
1. Uptime
2. Avg bandwidth
3. Avg ping
State based:
1. Finalized block
2. Average block time
3. Time since last block
4. Block propagation time
5. Avg number of transactions in blocks
6. Number of pending transaction
7. Number of active users (signatories)

Decisions

Minimize middleware count
Split private data (used for administration purposes) and public data

Additional Information

Prometheus metric types https://prometheus.io/docs/concepts/metric_types/

Substrate telemetry types and open-source monitor: https://github.com/paritytech/substrate-telemetry/blob/master/backend/src/types.rs