Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add overview of telemetry design and define initial data structures

...

Page properties
label


Status

Status
colourRed
titleNOT STARTED
 

Stakeholders
Outcome
Due date
OwnerIvan Rybin


Background

Observability is key thing in every application development, but apart from logs it can be achived with other tools like telemetry.

Problem

Main question for now is what telemetry we want to collect? Another problem is what metric we can expose at prometheus level and which of them cannot be presented by prometheus types.

Solution

There are 2 types of telemetry, peer local and global info. Here is all info about them and prometheus types which can be used for them

Peer info

  • Peer name
  • Peer location
  • Iroha version (counter?)
  • Peer uptime (counter?)
  • Peer current role (gauge with integer as role?)
  • Its networking speed (gauge)
  • Its latency (gauge)
  • Last available block on peer (gauge)
  • Last finalized block by peer (gauge)
  • Block time (gauge)
  • Block propagation time (gauge)
  • Number of pending transactions (gauge)

Global info

  • Finalized block (gauge)
  • Average block time (histogram/summary)
  • Time since last block (gauge)
  • Block propagation time (histogram/summary)
  • Number of transactions in block (gauge)
  • Info about gas

Decisions

Alternatives

Concerns

Assumptions

...

A rich toolkit for collecting information about the execution of decentralized and distributed systems is the only way to react in time and notice failures

Core requirements

  1. Distributed design requires continuous collection of information from all nodes of the system, regardless of its integrity (ex. connection issues or desync)
  2. A single solution regardless of the final configuration (a local network of developers from several nodes, a distributed business network with different administrators, or even a public network)
  3. Isolation of public and private information without disclosure

Overview

Accordingly, with point #1, nodes must transmit their telemetry data to the configured endpoint. Aggregation and caching for future analysis is the responsibility of the end service.

Based on the nature of data node selects public or private endpoint as the destination point. All transmission is done in a push model from the node to the end service.

Pic #1:

Image Added

As shown in the picture, participation in the collection of telemetry data is optional and Company C avoids it. To respect this choice, the publicly available data should not contain any details about company C nodes.

End service

To support various monitoring and analytics tools, the end service can run in three different ways:
1) As a simple proxy of incoming data to support tools with push-model. The service will wrap incoming data with node id and push forward.
2) As a queue. The service will cache all incoming data until a future request from the consumer(s).
3) As an aggregation layer. The service will store all incoming data and calculate summaries for future requests.

Data

Raw messages from nodes

CategoryPublicPrivate
Initial

network_id: string

chain: string

name: string

version: string

location?: string

public_key: pubkey

startup_time: number

os_name: string

ip_address: string

port: string

modules: actor_slug[]

Tick

connected_peers: integer

tx_count: integer

bandwidth_upload: float

bandwidth_download: float

finalized_height: integer

finalized_hash: hash

latest_height: integer

latest_hash: hash

available_ram: integer

available_disk: integer

cpu_consumption: float



Block

type: proposal | vote |  commit | finalized

height: integer

hash: hash

validators: pubkey[]

transactions: hash[]


Transaction

hash: hash

size: intereger

signatories: pubkey[]

data?: blob


Connection

type: discovered | connected | disconnected

ping: integer

public_key: pubkey

karma?: integer

ip_address: string

port: string

Aggregations

  1. Peer based:
    1. Uptime
    2. Avg bandwidth
    3. Avg ping
  2. State based:
    1. Finalized block
    2. Average block time
    3. Time since last block
    4. Block propagation time
    5. Avg number of transactions in blocks
    6. Number of pending transaction
    7. Number of active users (signatories)

Decisions

  1. Minimize middleware count
  2. Split private data (used for administration purposes) and public data

Additional Information

Prometheus metric types https://prometheus.io/docs/concepts/metric_types/

Substrate telemetry types and open-source monitor: https://github.com/paritytech/substrate-telemetry/blob/master/backend/src/types.rs