This directory contains sample network data for demonstrating CrossCheck.
topology.json: Network topology (3 nodes, linear topology A-B-C)telemetry.pkl: Telemetry data (5 snapshots with perturbed counters)paths/: Directory containing routing paths for each timestamp
The topology file (topology.json) defines the network structure:
{
"nodes": [
{"id": 0, "name": "NodeA"},
{"id": 1, "name": "NodeB"},
{"id": 2, "name": "NodeC"}
],
"links": [
{"source": 0, "target": 1},
{"source": 1, "target": 2}
],
"external_nodes": ["NodeA", "NodeB", "NodeC"]
}Fields:
nodes: List of network nodes with unique integer IDs and string nameslinks: List of bidirectional links between nodes (specified by node IDs)external_nodes(optional): Nodes with external traffic (ingress/egress to/from network)
Notes:
- All links are treated as bidirectional
- Node IDs must be unique integers starting from 0
- Node names must be unique strings
The telemetry file (telemetry.pkl) is a pandas DataFrame where each row is a network snapshot.
timestamp(str): Snapshot timestamp in format "YYYY/MM/DD HH:MM UTC"telemetry_perturbed_type(str): Type of perturbation applied to counters (e.g., "scaled", "NONE")input_perturbed_type(str): Type of perturbation applied to demands (e.g., "none", "NONE")true_detect_inconsistent(bool): Ground truth flag indicating if inconsistency exists
Format: high_{src_node}_{dst_node}
Example: high_NodeA_NodeB = traffic demand from NodeA to NodeB
Value: Float representing traffic volume
Format: low_{node}_{interface_type}_{neighbor} or low_{node}_{external_type}
Examples:
low_NodeA_egress_to_NodeB: Traffic sent from NodeA to NodeBlow_NodeB_ingress_from_NodeA: Traffic received by NodeB from NodeAlow_NodeA_origination: External traffic entering network at NodeAlow_NodeA_termination: External traffic leaving network at NodeA
Value: Dictionary with keys:
{
'ground_truth': float, # Original correct value (for evaluation)
'perturbed': float, # Measured value with errors
'corrected': float, # Repaired value (added by CrossCheck)
'confidence': float # Repair confidence 0-1 (added by CrossCheck)
}For each bidirectional link between NodeA and NodeB, there are 4 counter columns:
low_NodeA_egress_to_NodeB: Traffic leaving NodeA toward NodeB (reported by NodeA)low_NodeB_ingress_from_NodeA: Traffic entering NodeB from NodeA (reported by NodeB)low_NodeB_egress_to_NodeA: Traffic leaving NodeB toward NodeA (reported by NodeB)low_NodeA_ingress_from_NodeB: Traffic entering NodeA from NodeB (reported by NodeA)
These represent the same physical traffic flows but measured from different perspectives, which may be inconsistent due to measurement errors.
For nodes with external traffic (specified in external_nodes):
low_{node}_origination: Traffic entering the network at this nodelow_{node}_termination: Traffic leaving the network at this node
Create a JSON file with your network structure:
import json
topology = {
"nodes": [
{"id": 0, "name": "Router1"},
{"id": 1, "name": "Router2"},
# ... more nodes
],
"links": [
{"source": 0, "target": 1},
# ... more links
],
"external_nodes": ["Router1", "Router2"] # Optional
}
with open('my_topology.json', 'w') as f:
json.dump(topology, f, indent=2)Create a pandas DataFrame with the required columns:
import pandas as pd
# Create rows (one per snapshot)
data = []
for timestamp in timestamps:
row = {
'timestamp': timestamp,
'telemetry_perturbed_type': 'measured',
'input_perturbed_type': 'NONE',
'true_detect_inconsistent': False,
}
# Add demands (high_*
)
for src, dst in node_pairs:
row[f'high_{src}_{dst}'] = get_demand(src, dst, timestamp)
# Add counters (low_*)
for link in topology['links']:
src = topology['nodes'][link['source']]['name']
dst = topology['nodes'][link['target']]['name']
# Add interface counters as dicts
row[f'low_{src}_egress_to_{dst}'] = {
'ground_truth': None, # Optional for real data
'perturbed': get_counter(src, dst, timestamp)
}
# ... add ingress and reverse direction counters
# Add external interfaces if applicable
for node in topology.get('external_nodes', []):
row[f'low_{node}_origination'] = {
'ground_truth': None,
'perturbed': get_external_ingress(node, timestamp)
}
row[f'low_{node}_termination'] = {
'ground_truth': None,
'perturbed': get_external_egress(node, timestamp)
}
data.append(row)
df = pd.DataFrame(data)
df.to_pickle('my_telemetry.pkl')- Complete coverage: All node pairs must have demand columns, all links must have counter columns
- Consistent naming: Use exact node names from topology
- Valid values: All numeric values must be non-negative floats
- Bidirectional: For each link, include both directions (egress and ingress from both perspectives)
- External interfaces: If any node has
*_originationor*_termination, include it inexternal_nodes
- Missing columns: Ensure all node pairs have demands and all links have counters
- Name mismatches: Node names in telemetry columns must exactly match topology
- Incorrect format: Counter columns must be dicts, demand columns must be floats
- Negative values: Network counters cannot be negative
For more details, see the main README.md and API documentation.
The sample_data/ directory also contains a subset of real network telemetry from the Abilene academic backbone network.
abilene_subset.pkl: 100 snapshots of real network telemetry (March 1-5, 2004)abilene_topology.json: Abilene network topology (12 nodes, 15 links)abilene_paths.json: Routing paths for the Abilene network
The Abilene network consists of 12 nodes representing major US cities:
- ATLAM5, ATLAng (Atlanta)
- CHINng (Chicago)
- DNVRng (Denver)
- HSTNng (Houston)
- IPLSng (Indianapolis)
- KSCYng (Kansas City)
- LOSAng (Los Angeles)
- NYCMng (New York)
- SNVAng (Sunnyvale/San Jose)
- STTLng (Seattle)
- WASHng (Washington DC)
The Abilene data follows the same format as the synthetic data (see above), with:
- Low-level interface counters (
low_*) - High-level traffic demands (
high_*) - Metadata columns
The main differences:
- Larger scale: 12 nodes vs 3 nodes, 15 links vs 2 links
- Real traffic patterns: Actual measured traffic from 2004
- More counters: 84 interface counters (vs 8 in simple example)
- 132 demands: All node pairs (vs 6 in simple example)
The Abilene dataset is a well-known network research dataset originally collected from the Internet2 Abilene network. The data has been preprocessed into the CrossCheck format with clean ground truth values.
The Abilene subset was extracted using generate_abilene_subset.py:
python3 generate_abilene_subset.pyThis extracts the first 100 snapshots from the full dataset (4008 snapshots total), reducing the file size from 34 MB to 0.6 MB for easier distribution.