Spaxiom: Federated Sensor-RL Gym - Part 14 of the Technical Series

8. Federated Sensor-RL Gym

8.1 Local environments, shared semantics

Each Spaxiom deployment naturally forms a local RL environment:

Observation: Spaxiom events and quantities at time t.
Action: actuator commands (HVAC, routing, staffing suggestions, robot modes, etc.).
Reward: defined over energy savings, service quality, safety, or other KPIs.

Let there be N sites. For site i, with experience distribution 𝒟_i, define the shared objective:

max_θ Σ_i=1^N w_i · 𝔼_{τ∼𝒟_i} [R_i(τ; θ)]

where:

θ are model parameters (policy or value network).
w_i is a weight for site i (e.g., proportional to volume or strategic importance).

Crucially, Spaxiom ensures that event schemas and reward semantics are aligned across sites. That makes it possible to:

Train local models per site.
Perform federated learning by aggregating gradients or model weights without moving raw sensor logs.

8.2 Event-sufficient statistics

Instead of shipping raw trajectories τ (which might be huge time series of sensor values), each site can ship event-sufficient statistics:

counts & co-occurrence of certain INTENT events,
empirical transition probabilities between event types,
summary statistics for rewards given event contexts.

For many control and planning tasks, these event-level statistics retain enough information to improve policies globally, while significantly reducing bandwidth and privacy risk.

Figure 5: Multiple sites each run Spaxiom + RL locally. Periodically, they send model updates or compressed event statistics to an aggregator. The aggregator updates a global model and sends it back. Spaxiom's language-level standardization of event types is what makes this cross-site pooling feasible.

Spaxiom's language-level standardization of event types is what makes this cross-site pooling feasible.

8.3 Distributed Consensus and Event Ordering

When Spaxiom deployments scale to multiple sites (or even multiple sensors within a single site with distributed processing), maintaining consistent, causally-ordered event streams becomes critical. Distributed systems challenges arise:

Clock skew: different sensors/sites have slightly different wall-clock times
Network partitions: sites may temporarily lose connectivity to the aggregator
Concurrent events: events from different sensors may occur "simultaneously" but be observed in different orders
Causal dependencies: some events cause others (e.g., "door opened" causes "person entered"), and this order must be preserved

This section describes Spaxiom's approach to distributed event ordering, consensus, and consistency guarantees.

Timestamp semantics: wall-clock vs logical clocks

Spaxiom events include timestamps, but what do these timestamps mean?

Wall-clock timestamps (default)

By default, events use wall-clock timestamps (UTC, via NTP or PTP synchronization):

{
    "type": "DoorOpened",
    "site_id": "hospital-5f",
    "zone": "ward-b-door-2",
    "timestamp": "2025-11-06T14:23:45.123456Z",  # ISO 8601 UTC
    "sensor_id": "door_sensor_42"
}

Wall-clock timestamps work well when:

All devices use NTP (Network Time Protocol, ±10-100 ms accuracy) or PTP (Precision Time Protocol, ±1 μs accuracy)
Events from different sites are aggregated asynchronously (eventual consistency is acceptable)
Absolute time ordering is important for compliance, forensics, or business analytics

However, wall-clock timestamps have limitations:

Clock skew: even with NTP, clocks drift. A "later" event may have an earlier timestamp if clocks are misaligned.
Leap seconds: UTC occasionally inserts leap seconds, causing time to "go backwards" briefly.
No causality guarantee: timestamp t_A < t_B does not imply event A caused event B.

Logical clocks (Lamport timestamps)

For causal ordering, Spaxiom supports Lamport logical clocks. Each event carries a counter that is incremented on every event and synchronized on message exchange:

{
    "type": "DoorOpened",
    "site_id": "hospital-5f",
    "zone": "ward-b-door-2",
    "lamport_clock": 1247,  # Logical timestamp
    "wall_timestamp": "2025-11-06T14:23:45.123456Z"
}

Lamport clock rules:

Each site/process maintains a local counter L, initially 0.
Before emitting an event, increment: L := L + 1.
Attach current L to the event.
On receiving an event with timestamp L', update: L := max(L, L') + 1.

Lamport clocks guarantee: if event A causally precedes event B (A → B), then L_A < L_B.

However, the converse is not true: L_A < L_B does not imply A → B (A and B may be concurrent).

Vector clocks (for full causality)

For applications requiring full causal ordering (e.g., distributed debugging, conflict-free replicated data types), Spaxiom supports vector clocks:

{
    "type": "DoorOpened",
    "site_id": "hospital-5f",
    "zone": "ward-b-door-2",
    "vector_clock": {
        "hospital-5f": 1247,
        "hospital-3a": 892,
        "cloud-aggregator": 5643
    }
}

Vector clock V is a dictionary mapping site IDs to counters. Comparison:

V_A < V_B (A causally precedes B) iff ∀ site: V_A[site] ≤ V_B[site] and ∃ site: V_A[site] < V_B[site]
If neither V_A < V_B nor V_B < V_A, events are concurrent

Vector clocks provide full causality but scale O(N) with number of sites (each event carries a vector of size N). For large federations (1000+ sites), this becomes impractical.

Event ordering strategies

Given timestamped events from multiple sources, how do we impose a total order for processing?

Strategy 1: Wall-clock ordering with buffering

Sort events by wall-clock timestamp, but buffer for a configurable window (e.g., 1 second) to tolerate clock skew:

from spaxiom.distributed import EventBuffer

buffer = EventBuffer(
    window_s=1.0,  # Buffer events for 1 second
    clock_type="wall"
)

# Events arrive out-of-order
buffer.add(event_A)  # timestamp: 14:23:45.500
buffer.add(event_C)  # timestamp: 14:23:46.000
buffer.add(event_B)  # timestamp: 14:23:45.800

# After 1 second, flush sorted events
ordered_events = buffer.flush()  # [event_A, event_B, event_C]

This strategy works well for soft real-time analytics (e.g., dashboards, BI queries) where 1-5 second latency is acceptable.

Strategy 2: Lamport ordering for causal consistency

Sort events by Lamport clock, breaking ties by site ID (lexicographic):

def compare_lamport(event_a, event_b):
    if event_a["lamport_clock"] < event_b["lamport_clock"]:
        return -1
    elif event_a["lamport_clock"] > event_b["lamport_clock"]:
        return 1
    else:
        # Tie-break by site_id (deterministic but arbitrary)
        return compare(event_a["site_id"], event_b["site_id"])

ordered = sorted(events, key=functools.cmp_to_key(compare_lamport))

This ensures causally-related events are processed in order, but concurrent events may be ordered arbitrarily (deterministically).

Strategy 3: Vector clock ordering with conflict detection

For critical applications (e.g., financial transactions, safety decisions), use vector clocks to detect concurrent events and handle conflicts explicitly:

from spaxiom.distributed import VectorClockOrdering

ordering = VectorClockOrdering()

for event in incoming_stream:
    ordering.add(event)

# Process causally-ready events
while ordering.has_ready():
    event = ordering.pop_next_causal()
    process(event)

# Detect conflicts
conflicts = ordering.get_concurrent_events()
for (event_a, event_b) in conflicts:
    resolve_conflict(event_a, event_b)  # Application-specific logic

Consensus protocols for critical events

Some events require distributed consensus: all sites must agree on whether an event occurred and its ordering relative to other events. Examples:

Global emergency stop: one site detects a critical failure and broadcasts e-stop; all sites must acknowledge and halt
Resource allocation: multiple sites compete for a shared resource (e.g., energy quota, compute slot); exactly one must win
Schema updates: a new event type is deployed; all sites must upgrade atomically

Spaxiom integrates with Raft and Paxos consensus libraries:

from spaxiom.distributed import RaftCluster

# Initialize Raft cluster with 5 sites
cluster = RaftCluster(
    sites=["hospital-5f", "warehouse-b", "retail-c", "office-d", "datacenter-e"],
    leader="hospital-5f"
)

# Propose critical event (requires majority vote)
event = {"type": "GlobalEmergencyStop", "reason": "Fire detected", "site": "hospital-5f"}
success = cluster.propose(event, timeout_s=5.0)

if success:
    # Event committed to replicated log, all sites notified
    broadcast_estop()
else:
    # Consensus failed (network partition, timeout)
    log.error("Failed to reach consensus on emergency stop")

Raft guarantees:

Consistency: all sites see the same sequence of committed events
Durability: once committed, events survive site failures (replicated to majority)
Linearizability: events appear to occur atomically at a single point in time

However, consensus has costs:

Latency: requires round-trip communication (typically 10-100 ms in LAN, 100-500 ms in WAN)
Availability: if majority of sites are unreachable, system halts (CAP theorem: consistency + partition tolerance → no availability)

Therefore, consensus is used sparingly for critical events only (e.g., safety violations, resource contention). Normal sensor events use weaker ordering (wall-clock or Lamport).

Handling network partitions

Network partitions are inevitable in distributed systems. Spaxiom provides partition-tolerant modes:

Mode 1: Local autonomy (AP in CAP)

Edge sites continue operating independently during partition, accepting that global state may diverge. When partition heals, use merge strategies:

Last-write-wins (LWW): use Lamport or vector clocks to pick "later" events
Application-specific reconciliation: custom merge logic (e.g., sum energy consumption from both partitions)
Conflict-free replicated data types (CRDTs): use mathematically-provable merge functions

from spaxiom.distributed import PartitionTolerantStore

store = PartitionTolerantStore(
    consistency="eventual",  # AP in CAP
    merge_strategy="lww"  # Last-write-wins
)

# During partition, each site writes locally
store.put("energy_used", 150.5, site="hospital-5f", lamport=1247)
store.put("energy_used", 98.3, site="warehouse-b", lamport=1248)

# After partition heals, merge
store.sync()  # Uses LWW: energy_used = 98.3 (higher Lamport clock)

Mode 2: Consistency-first (CP in CAP)

For safety-critical operations, sites halt if they lose contact with consensus leader (sacrificing availability for consistency):

from spaxiom.distributed import ConsistentStore

store = ConsistentStore(
    consistency="strong",  # CP in CAP
    quorum_size=3  # Requires 3/5 sites reachable
)

try:
    store.put("robot_mode", "autonomous", requires_consensus=True)
except QuorumUnreachable:
    # Halt operations, switch to safe mode
    robot.safe_mode()
    alert("Partition detected, robot halted")

Event deduplication and idempotency

Network retries and partition healing can cause duplicate events. Spaxiom ensures idempotent processing:

Event IDs: each event has a globally unique ID (UUID or site_id + sequence number).
Deduplication window: runtime maintains a bloom filter or hash table of recently-seen event IDs (e.g., last 1 hour).
Idempotent handlers: callbacks are written to be idempotent (safe to call multiple times).

{
    "type": "DoorOpened",
    "event_id": "550e8400-e29b-41d4-a716-446655440000",  # UUID
    "site_id": "hospital-5f",
    "timestamp": "2025-11-06T14:23:45.123456Z"
}

# Runtime deduplicates
@on(door_opened)
def handle_door_opened(event):
    # This will only be called once per unique event_id, even if
    # the event is received multiple times due to network retries
    log_entry_exit(event)

Scalability: hierarchical aggregation

For very large deployments (1000+ sites), flat architectures (all sites → single aggregator) don't scale. Spaxiom supports hierarchical aggregation:

Regional aggregators:

Collect events from local sites (10-100 ms latency)
Perform local aggregation (e.g., compute regional statistics)
Forward compressed summaries to global aggregator (100-500 ms latency)

This architecture scales to 10,000+ sites while maintaining sub-second end-to-end latency for non-critical events.

Summary: distributed event ordering guarantees

Spaxiom provides a spectrum of consistency/availability tradeoffs:

Mode	Clock Type	Ordering Guarantee	Latency	Use Case
Best-effort	Wall-clock	Eventual consistency	10-100 ms	Analytics, dashboards
Causal	Lamport	Causal consistency	10-100 ms	Federated RL, forensics
Causal+	Vector clock	Full causality + concurrency detection	50-200 ms	Debugging, conflict resolution
Consensus	Raft/Paxos	Linearizability	100-500 ms	Safety-critical events, resource allocation

By making clock semantics and ordering strategies explicit and configurable, Spaxiom enables developers to make principled tradeoffs between consistency, availability, and latency based on application requirements.

About This Section