Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

Sample Data Format

This directory contains sample network data for demonstrating CrossCheck.

Files

  • topology.json: Network topology (3 nodes, linear topology A-B-C)
  • telemetry.pkl: Telemetry data (5 snapshots with perturbed counters)
  • paths/: Directory containing routing paths for each timestamp

Topology Format

The topology file (topology.json) defines the network structure:

{
  "nodes": [
    {"id": 0, "name": "NodeA"},
    {"id": 1, "name": "NodeB"},
    {"id": 2, "name": "NodeC"}
  ],
  "links": [
    {"source": 0, "target": 1},
    {"source": 1, "target": 2}
  ],
  "external_nodes": ["NodeA", "NodeB", "NodeC"]
}

Fields:

  • nodes: List of network nodes with unique integer IDs and string names
  • links: List of bidirectional links between nodes (specified by node IDs)
  • external_nodes (optional): Nodes with external traffic (ingress/egress to/from network)

Notes:

  • All links are treated as bidirectional
  • Node IDs must be unique integers starting from 0
  • Node names must be unique strings

Telemetry Format

The telemetry file (telemetry.pkl) is a pandas DataFrame where each row is a network snapshot.

Columns

Metadata Columns

  • timestamp (str): Snapshot timestamp in format "YYYY/MM/DD HH:MM UTC"
  • telemetry_perturbed_type (str): Type of perturbation applied to counters (e.g., "scaled", "NONE")
  • input_perturbed_type (str): Type of perturbation applied to demands (e.g., "none", "NONE")
  • true_detect_inconsistent (bool): Ground truth flag indicating if inconsistency exists

Demand Columns

Format: high_{src_node}_{dst_node}

Example: high_NodeA_NodeB = traffic demand from NodeA to NodeB

Value: Float representing traffic volume

Counter Columns

Format: low_{node}_{interface_type}_{neighbor} or low_{node}_{external_type}

Examples:

  • low_NodeA_egress_to_NodeB: Traffic sent from NodeA to NodeB
  • low_NodeB_ingress_from_NodeA: Traffic received by NodeB from NodeA
  • low_NodeA_origination: External traffic entering network at NodeA
  • low_NodeA_termination: External traffic leaving network at NodeA

Value: Dictionary with keys:

{
    'ground_truth': float,    # Original correct value (for evaluation)
    'perturbed': float,       # Measured value with errors
    'corrected': float,       # Repaired value (added by CrossCheck)
    'confidence': float       # Repair confidence 0-1 (added by CrossCheck)
}

Interface Naming Convention

For each bidirectional link between NodeA and NodeB, there are 4 counter columns:

  1. low_NodeA_egress_to_NodeB: Traffic leaving NodeA toward NodeB (reported by NodeA)
  2. low_NodeB_ingress_from_NodeA: Traffic entering NodeB from NodeA (reported by NodeB)
  3. low_NodeB_egress_to_NodeA: Traffic leaving NodeB toward NodeA (reported by NodeB)
  4. low_NodeA_ingress_from_NodeB: Traffic entering NodeA from NodeB (reported by NodeA)

These represent the same physical traffic flows but measured from different perspectives, which may be inconsistent due to measurement errors.

External Interfaces

For nodes with external traffic (specified in external_nodes):

  • low_{node}_origination: Traffic entering the network at this node
  • low_{node}_termination: Traffic leaving the network at this node

Creating Your Own Data

1. Topology

Create a JSON file with your network structure:

import json

topology = {
    "nodes": [
        {"id": 0, "name": "Router1"},
        {"id": 1, "name": "Router2"},
        # ... more nodes
    ],
    "links": [
        {"source": 0, "target": 1},
        # ... more links
    ],
    "external_nodes": ["Router1", "Router2"]  # Optional
}

with open('my_topology.json', 'w') as f:
    json.dump(topology, f, indent=2)

2. Telemetry

Create a pandas DataFrame with the required columns:

import pandas as pd

# Create rows (one per snapshot)
data = []
for timestamp in timestamps:
    row = {
        'timestamp': timestamp,
        'telemetry_perturbed_type': 'measured',
        'input_perturbed_type': 'NONE',
        'true_detect_inconsistent': False,
    }

    # Add demands (high_*
)
    for src, dst in node_pairs:
        row[f'high_{src}_{dst}'] = get_demand(src, dst, timestamp)

    # Add counters (low_*)
    for link in topology['links']:
        src = topology['nodes'][link['source']]['name']
        dst = topology['nodes'][link['target']]['name']

        # Add interface counters as dicts
        row[f'low_{src}_egress_to_{dst}'] = {
            'ground_truth': None,  # Optional for real data
            'perturbed': get_counter(src, dst, timestamp)
        }
        # ... add ingress and reverse direction counters

    # Add external interfaces if applicable
    for node in topology.get('external_nodes', []):
        row[f'low_{node}_origination'] = {
            'ground_truth': None,
            'perturbed': get_external_ingress(node, timestamp)
        }
        row[f'low_{node}_termination'] = {
            'ground_truth': None,
            'perturbed': get_external_egress(node, timestamp)
        }

    data.append(row)

df = pd.DataFrame(data)
df.to_pickle('my_telemetry.pkl')

Data Requirements

  1. Complete coverage: All node pairs must have demand columns, all links must have counter columns
  2. Consistent naming: Use exact node names from topology
  3. Valid values: All numeric values must be non-negative floats
  4. Bidirectional: For each link, include both directions (egress and ingress from both perspectives)
  5. External interfaces: If any node has *_origination or *_termination, include it in external_nodes

Common Issues

  • Missing columns: Ensure all node pairs have demands and all links have counters
  • Name mismatches: Node names in telemetry columns must exactly match topology
  • Incorrect format: Counter columns must be dicts, demand columns must be floats
  • Negative values: Network counters cannot be negative

For more details, see the main README.md and API documentation.


Abilene Network Data

The sample_data/ directory also contains a subset of real network telemetry from the Abilene academic backbone network.

Files

  • abilene_subset.pkl: 100 snapshots of real network telemetry (March 1-5, 2004)
  • abilene_topology.json: Abilene network topology (12 nodes, 15 links)
  • abilene_paths.json: Routing paths for the Abilene network

Network Structure

The Abilene network consists of 12 nodes representing major US cities:

  • ATLAM5, ATLAng (Atlanta)
  • CHINng (Chicago)
  • DNVRng (Denver)
  • HSTNng (Houston)
  • IPLSng (Indianapolis)
  • KSCYng (Kansas City)
  • LOSAng (Los Angeles)
  • NYCMng (New York)
  • SNVAng (Sunnyvale/San Jose)
  • STTLng (Seattle)
  • WASHng (Washington DC)

Data Format

The Abilene data follows the same format as the synthetic data (see above), with:

  • Low-level interface counters (low_*)
  • High-level traffic demands (high_*)
  • Metadata columns

The main differences:

  • Larger scale: 12 nodes vs 3 nodes, 15 links vs 2 links
  • Real traffic patterns: Actual measured traffic from 2004
  • More counters: 84 interface counters (vs 8 in simple example)
  • 132 demands: All node pairs (vs 6 in simple example)

Dataset Origin

The Abilene dataset is a well-known network research dataset originally collected from the Internet2 Abilene network. The data has been preprocessed into the CrossCheck format with clean ground truth values.

Generating the Subset

The Abilene subset was extracted using generate_abilene_subset.py:

python3 generate_abilene_subset.py

This extracts the first 100 snapshots from the full dataset (4008 snapshots total), reducing the file size from 34 MB to 0.6 MB for easier distribution.