Spaxiom Logo
Spaxiom Technical Series - Part 14

Federated Sensor-RL Gym

Distributed Event Ordering, Consensus Protocols, and Multi-Site Reinforcement Learning

Joe Scanlin

November 2025

About This Section

This section demonstrates how Spaxiom deployments naturally form federated RL environments where multiple sites can train local models and share learnings without centralizing raw sensor data. Each deployment becomes a local RL environment with observations (Spaxiom events), actions (actuator commands), and rewards (KPIs).

You'll learn about event-sufficient statistics for privacy-preserving federated learning, distributed consensus protocols (Raft, Paxos), event ordering strategies (wall-clock, Lamport clocks, vector clocks), network partition handling, deduplication and idempotency, hierarchical aggregation for scaling to 10,000+ sites, and consistency/availability tradeoffs across the CAP theorem spectrum. Includes complete architecture diagrams and implementation examples.

8. Federated Sensor-RL Gym

8.1 Local environments, shared semantics

Each Spaxiom deployment naturally forms a local RL environment:

Let there be N sites. For site i, with experience distribution 𝒟i, define the shared objective:

maxθ Σi=1N wi · 𝔼τ∼𝒟i [Ri(τ; θ)]

where:

Crucially, Spaxiom ensures that event schemas and reward semantics are aligned across sites. That makes it possible to:

8.2 Event-sufficient statistics

Instead of shipping raw trajectories τ (which might be huge time series of sensor values), each site can ship event-sufficient statistics:

For many control and planning tasks, these event-level statistics retain enough information to improve policies globally, while significantly reducing bandwidth and privacy risk.

Global Aggregator Hospital-A Spaxiom + RL Warehouse-B Spaxiom + RL Retail-C Spaxiom + RL Office-D Spaxiom + RL model updates global model event stats global model event stats global model model updates global model

Figure 5: Multiple sites each run Spaxiom + RL locally. Periodically, they send model updates or compressed event statistics to an aggregator. The aggregator updates a global model and sends it back. Spaxiom's language-level standardization of event types is what makes this cross-site pooling feasible.

Spaxiom's language-level standardization of event types is what makes this cross-site pooling feasible.

8.3 Distributed Consensus and Event Ordering

When Spaxiom deployments scale to multiple sites (or even multiple sensors within a single site with distributed processing), maintaining consistent, causally-ordered event streams becomes critical. Distributed systems challenges arise:

This section describes Spaxiom's approach to distributed event ordering, consensus, and consistency guarantees.

Timestamp semantics: wall-clock vs logical clocks

Spaxiom events include timestamps, but what do these timestamps mean?

Wall-clock timestamps (default)

By default, events use wall-clock timestamps (UTC, via NTP or PTP synchronization):

{
    "type": "DoorOpened",
    "site_id": "hospital-5f",
    "zone": "ward-b-door-2",
    "timestamp": "2025-11-06T14:23:45.123456Z",  # ISO 8601 UTC
    "sensor_id": "door_sensor_42"
}

Wall-clock timestamps work well when:

However, wall-clock timestamps have limitations:

Logical clocks (Lamport timestamps)

For causal ordering, Spaxiom supports Lamport logical clocks. Each event carries a counter that is incremented on every event and synchronized on message exchange:

{
    "type": "DoorOpened",
    "site_id": "hospital-5f",
    "zone": "ward-b-door-2",
    "lamport_clock": 1247,  # Logical timestamp
    "wall_timestamp": "2025-11-06T14:23:45.123456Z"
}

Lamport clock rules:

  1. Each site/process maintains a local counter L, initially 0.
  2. Before emitting an event, increment: L := L + 1.
  3. Attach current L to the event.
  4. On receiving an event with timestamp L', update: L := max(L, L') + 1.

Lamport clocks guarantee: if event A causally precedes event B (A → B), then LA < LB.

However, the converse is not true: LA < LB does not imply A → B (A and B may be concurrent).

Vector clocks (for full causality)

For applications requiring full causal ordering (e.g., distributed debugging, conflict-free replicated data types), Spaxiom supports vector clocks:

{
    "type": "DoorOpened",
    "site_id": "hospital-5f",
    "zone": "ward-b-door-2",
    "vector_clock": {
        "hospital-5f": 1247,
        "hospital-3a": 892,
        "cloud-aggregator": 5643
    }
}

Vector clock V is a dictionary mapping site IDs to counters. Comparison:

Vector clocks provide full causality but scale O(N) with number of sites (each event carries a vector of size N). For large federations (1000+ sites), this becomes impractical.

Event ordering strategies

Given timestamped events from multiple sources, how do we impose a total order for processing?

Strategy 1: Wall-clock ordering with buffering

Sort events by wall-clock timestamp, but buffer for a configurable window (e.g., 1 second) to tolerate clock skew:

from spaxiom.distributed import EventBuffer

buffer = EventBuffer(
    window_s=1.0,  # Buffer events for 1 second
    clock_type="wall"
)

# Events arrive out-of-order
buffer.add(event_A)  # timestamp: 14:23:45.500
buffer.add(event_C)  # timestamp: 14:23:46.000
buffer.add(event_B)  # timestamp: 14:23:45.800

# After 1 second, flush sorted events
ordered_events = buffer.flush()  # [event_A, event_B, event_C]

This strategy works well for soft real-time analytics (e.g., dashboards, BI queries) where 1-5 second latency is acceptable.

Strategy 2: Lamport ordering for causal consistency

Sort events by Lamport clock, breaking ties by site ID (lexicographic):

def compare_lamport(event_a, event_b):
    if event_a["lamport_clock"] < event_b["lamport_clock"]:
        return -1
    elif event_a["lamport_clock"] > event_b["lamport_clock"]:
        return 1
    else:
        # Tie-break by site_id (deterministic but arbitrary)
        return compare(event_a["site_id"], event_b["site_id"])

ordered = sorted(events, key=functools.cmp_to_key(compare_lamport))

This ensures causally-related events are processed in order, but concurrent events may be ordered arbitrarily (deterministically).

Strategy 3: Vector clock ordering with conflict detection

For critical applications (e.g., financial transactions, safety decisions), use vector clocks to detect concurrent events and handle conflicts explicitly:

from spaxiom.distributed import VectorClockOrdering

ordering = VectorClockOrdering()

for event in incoming_stream:
    ordering.add(event)

# Process causally-ready events
while ordering.has_ready():
    event = ordering.pop_next_causal()
    process(event)

# Detect conflicts
conflicts = ordering.get_concurrent_events()
for (event_a, event_b) in conflicts:
    resolve_conflict(event_a, event_b)  # Application-specific logic

Consensus protocols for critical events

Some events require distributed consensus: all sites must agree on whether an event occurred and its ordering relative to other events. Examples:

Spaxiom integrates with Raft and Paxos consensus libraries:

from spaxiom.distributed import RaftCluster

# Initialize Raft cluster with 5 sites
cluster = RaftCluster(
    sites=["hospital-5f", "warehouse-b", "retail-c", "office-d", "datacenter-e"],
    leader="hospital-5f"
)

# Propose critical event (requires majority vote)
event = {"type": "GlobalEmergencyStop", "reason": "Fire detected", "site": "hospital-5f"}
success = cluster.propose(event, timeout_s=5.0)

if success:
    # Event committed to replicated log, all sites notified
    broadcast_estop()
else:
    # Consensus failed (network partition, timeout)
    log.error("Failed to reach consensus on emergency stop")

Raft guarantees:

However, consensus has costs:

Therefore, consensus is used sparingly for critical events only (e.g., safety violations, resource contention). Normal sensor events use weaker ordering (wall-clock or Lamport).

Handling network partitions

Network partitions are inevitable in distributed systems. Spaxiom provides partition-tolerant modes:

Mode 1: Local autonomy (AP in CAP)

Edge sites continue operating independently during partition, accepting that global state may diverge. When partition heals, use merge strategies:

from spaxiom.distributed import PartitionTolerantStore

store = PartitionTolerantStore(
    consistency="eventual",  # AP in CAP
    merge_strategy="lww"  # Last-write-wins
)

# During partition, each site writes locally
store.put("energy_used", 150.5, site="hospital-5f", lamport=1247)
store.put("energy_used", 98.3, site="warehouse-b", lamport=1248)

# After partition heals, merge
store.sync()  # Uses LWW: energy_used = 98.3 (higher Lamport clock)
Mode 2: Consistency-first (CP in CAP)

For safety-critical operations, sites halt if they lose contact with consensus leader (sacrificing availability for consistency):

from spaxiom.distributed import ConsistentStore

store = ConsistentStore(
    consistency="strong",  # CP in CAP
    quorum_size=3  # Requires 3/5 sites reachable
)

try:
    store.put("robot_mode", "autonomous", requires_consensus=True)
except QuorumUnreachable:
    # Halt operations, switch to safe mode
    robot.safe_mode()
    alert("Partition detected, robot halted")

Event deduplication and idempotency

Network retries and partition healing can cause duplicate events. Spaxiom ensures idempotent processing:

  1. Event IDs: each event has a globally unique ID (UUID or site_id + sequence number).
  2. Deduplication window: runtime maintains a bloom filter or hash table of recently-seen event IDs (e.g., last 1 hour).
  3. Idempotent handlers: callbacks are written to be idempotent (safe to call multiple times).
{
    "type": "DoorOpened",
    "event_id": "550e8400-e29b-41d4-a716-446655440000",  # UUID
    "site_id": "hospital-5f",
    "timestamp": "2025-11-06T14:23:45.123456Z"
}

# Runtime deduplicates
@on(door_opened)
def handle_door_opened(event):
    # This will only be called once per unique event_id, even if
    # the event is received multiple times due to network retries
    log_entry_exit(event)

Scalability: hierarchical aggregation

For very large deployments (1000+ sites), flat architectures (all sites → single aggregator) don't scale. Spaxiom supports hierarchical aggregation:

Global Aggregator (Cloud) (10-100 regional aggs) Global Analytics & Training Regional Agg (North America) 100 sites 10-100ms latency Regional Agg (Europe) 80 sites 10-100ms latency Regional Agg (Asia) 120 sites 10-100ms latency 100-500ms Site 1 Hospital Site 2 Warehouse ... Site 100 Site 1 Retail ... Sites 2-79 ... Site 80 Site 1 Factory ... Sites 2-119 ... Site 120 Scales to 10,000+ sites with sub-second latency

Regional aggregators:

This architecture scales to 10,000+ sites while maintaining sub-second end-to-end latency for non-critical events.

Summary: distributed event ordering guarantees

Spaxiom provides a spectrum of consistency/availability tradeoffs:

Mode Clock Type Ordering Guarantee Latency Use Case
Best-effort Wall-clock Eventual consistency 10-100 ms Analytics, dashboards
Causal Lamport Causal consistency 10-100 ms Federated RL, forensics
Causal+ Vector clock Full causality + concurrency detection 50-200 ms Debugging, conflict resolution
Consensus Raft/Paxos Linearizability 100-500 ms Safety-critical events, resource allocation

By making clock semantics and ordering strategies explicit and configurable, Spaxiom enables developers to make principled tradeoffs between consistency, availability, and latency based on application requirements.