sample_data

Sample Data Format

This directory contains sample network data for demonstrating CrossCheck.

Files

topology.json: Network topology (3 nodes, linear topology A-B-C)
telemetry.pkl: Telemetry data (5 snapshots with perturbed counters)
paths/: Directory containing routing paths for each timestamp

Topology Format

The topology file (topology.json) defines the network structure:

{
  "nodes": [
    {"id": 0, "name": "NodeA"},
    {"id": 1, "name": "NodeB"},
    {"id": 2, "name": "NodeC"}
  ],
  "links": [
    {"source": 0, "target": 1},
    {"source": 1, "target": 2}
  ],
  "external_nodes": ["NodeA", "NodeB", "NodeC"]
}

Fields:

nodes: List of network nodes with unique integer IDs and string names
links: List of bidirectional links between nodes (specified by node IDs)
external_nodes (optional): Nodes with external traffic (ingress/egress to/from network)

Notes:

All links are treated as bidirectional
Node IDs must be unique integers starting from 0
Node names must be unique strings

Telemetry Format

The telemetry file (telemetry.pkl) is a pandas DataFrame where each row is a network snapshot.

Columns

Metadata Columns

timestamp (str): Snapshot timestamp in format "YYYY/MM/DD HH:MM UTC"
telemetry_perturbed_type (str): Type of perturbation applied to counters (e.g., "scaled", "NONE")
input_perturbed_type (str): Type of perturbation applied to demands (e.g., "none", "NONE")
true_detect_inconsistent (bool): Ground truth flag indicating if inconsistency exists

Demand Columns

Format: high_{src_node}_{dst_node}

Example: high_NodeA_NodeB = traffic demand from NodeA to NodeB

Value: Float representing traffic volume

Counter Columns

Format: low_{node}_{interface_type}_{neighbor} or low_{node}_{external_type}

Examples:

low_NodeA_egress_to_NodeB: Traffic sent from NodeA to NodeB
low_NodeB_ingress_from_NodeA: Traffic received by NodeB from NodeA
low_NodeA_origination: External traffic entering network at NodeA
low_NodeA_termination: External traffic leaving network at NodeA

Value: Dictionary with keys:

{
    'ground_truth': float,    # Original correct value (for evaluation)
    'perturbed': float,       # Measured value with errors
    'corrected': float,       # Repaired value (added by CrossCheck)
    'confidence': float       # Repair confidence 0-1 (added by CrossCheck)
}

Interface Naming Convention

For each bidirectional link between NodeA and NodeB, there are 4 counter columns:

low_NodeA_egress_to_NodeB: Traffic leaving NodeA toward NodeB (reported by NodeA)
low_NodeB_ingress_from_NodeA: Traffic entering NodeB from NodeA (reported by NodeB)
low_NodeB_egress_to_NodeA: Traffic leaving NodeB toward NodeA (reported by NodeB)
low_NodeA_ingress_from_NodeB: Traffic entering NodeA from NodeB (reported by NodeA)

These represent the same physical traffic flows but measured from different perspectives, which may be inconsistent due to measurement errors.

External Interfaces

For nodes with external traffic (specified in external_nodes):

low_{node}_origination: Traffic entering the network at this node
low_{node}_termination: Traffic leaving the network at this node

Creating Your Own Data

1. Topology

Create a JSON file with your network structure:

import json

topology = {
    "nodes": [
        {"id": 0, "name": "Router1"},
        {"id": 1, "name": "Router2"},
        # ... more nodes
    ],
    "links": [
        {"source": 0, "target": 1},
        # ... more links
    ],
    "external_nodes": ["Router1", "Router2"]  # Optional
}

with open('my_topology.json', 'w') as f:
    json.dump(topology, f, indent=2)

2. Telemetry

Create a pandas DataFrame with the required columns:

import pandas as pd

# Create rows (one per snapshot)
data = []
for timestamp in timestamps:
    row = {
        'timestamp': timestamp,
        'telemetry_perturbed_type': 'measured',
        'input_perturbed_type': 'NONE',
        'true_detect_inconsistent': False,
    }

    # Add demands (high_*
)
    for src, dst in node_pairs:
        row[f'high_{src}_{dst}'] = get_demand(src, dst, timestamp)

    # Add counters (low_*)
    for link in topology['links']:
        src = topology['nodes'][link['source']]['name']
        dst = topology['nodes'][link['target']]['name']

        # Add interface counters as dicts
        row[f'low_{src}_egress_to_{dst}'] = {
            'ground_truth': None,  # Optional for real data
            'perturbed': get_counter(src, dst, timestamp)
        }
        # ... add ingress and reverse direction counters

    # Add external interfaces if applicable
    for node in topology.get('external_nodes', []):
        row[f'low_{node}_origination'] = {
            'ground_truth': None,
            'perturbed': get_external_ingress(node, timestamp)
        }
        row[f'low_{node}_termination'] = {
            'ground_truth': None,
            'perturbed': get_external_egress(node, timestamp)
        }

    data.append(row)

df = pd.DataFrame(data)
df.to_pickle('my_telemetry.pkl')

Data Requirements

Complete coverage: All node pairs must have demand columns, all links must have counter columns
Consistent naming: Use exact node names from topology
Valid values: All numeric values must be non-negative floats
Bidirectional: For each link, include both directions (egress and ingress from both perspectives)
External interfaces: If any node has *_origination or *_termination, include it in external_nodes

Common Issues

Missing columns: Ensure all node pairs have demands and all links have counters
Name mismatches: Node names in telemetry columns must exactly match topology
Incorrect format: Counter columns must be dicts, demand columns must be floats
Negative values: Network counters cannot be negative

For more details, see the main README.md and API documentation.

Abilene Network Data

The sample_data/ directory also contains a subset of real network telemetry from the Abilene academic backbone network.

Files

abilene_subset.pkl: 100 snapshots of real network telemetry (March 1-5, 2004)
abilene_topology.json: Abilene network topology (12 nodes, 15 links)
abilene_paths.json: Routing paths for the Abilene network

Network Structure

The Abilene network consists of 12 nodes representing major US cities:

ATLAM5, ATLAng (Atlanta)
CHINng (Chicago)
DNVRng (Denver)
HSTNng (Houston)
IPLSng (Indianapolis)
KSCYng (Kansas City)
LOSAng (Los Angeles)
NYCMng (New York)
SNVAng (Sunnyvale/San Jose)
STTLng (Seattle)
WASHng (Washington DC)

Data Format

The Abilene data follows the same format as the synthetic data (see above), with:

Low-level interface counters (low_*)
High-level traffic demands (high_*)
Metadata columns

The main differences:

Larger scale: 12 nodes vs 3 nodes, 15 links vs 2 links
Real traffic patterns: Actual measured traffic from 2004
More counters: 84 interface counters (vs 8 in simple example)
132 demands: All node pairs (vs 6 in simple example)

Dataset Origin

The Abilene dataset is a well-known network research dataset originally collected from the Internet2 Abilene network. The data has been preprocessed into the CrossCheck format with clean ground truth values.

Generating the Subset

The Abilene subset was extracted using generate_abilene_subset.py:

python3 generate_abilene_subset.py

This extracts the first 100 snapshots from the full dataset (4008 snapshots total), reducing the file size from 34 MB to 0.6 MB for easier distribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Sample Data Format

Files

Topology Format

Telemetry Format

Columns

Metadata Columns

Demand Columns

Counter Columns

Interface Naming Convention

External Interfaces

Creating Your Own Data

1. Topology

2. Telemetry

Data Requirements

Common Issues

Abilene Network Data

Files

Network Structure

Data Format

Dataset Origin

Generating the Subset

Name		Name	Last commit message	Last commit date
parent directory ..
paths		paths
README.md		README.md
abilene_paths.json		abilene_paths.json
abilene_topology.json		abilene_topology.json
topology.json		topology.json

FilesExpand file tree

sample_data

Directory actions

More options

Directory actions

More options

Latest commit

History

sample_data

Folders and files

parent directory

README.md

Sample Data Format

Files

Topology Format

Telemetry Format

Columns

Metadata Columns

Demand Columns

Counter Columns

Interface Naming Convention

External Interfaces

Creating Your Own Data

1. Topology

2. Telemetry

Data Requirements

Common Issues

Abilene Network Data

Files

Network Structure

Data Format

Dataset Origin

Generating the Subset