Configuration

The configuration file is the only argument passed to ElastiSim and defines the parameters determining the conditions of the simulation. Following the JSON file format, users define configuration parameters by their keys and values. The following table lists all available configuration options.

Key Description Value type Default value Mandatory
jobs_file Path to the jobs file string - Yes
platform_file Path to the platform file string - Yes
zmq_url URL to establish the connection between the simulator and the scheduler process string - Yes
schedule_on_job_submit Whether a job submission triggers a scheduling algorithm invocation bool false Yes, if scheduling_interval is false
schedule_on_job_finalize Whether a job finalization triggers a scheduling algorithm invocation bool false Yes, if scheduling_interval is false
schedule_on_scheduling_point Whether a job reaching a scheduling point triggers a scheduling algorithm invocation bool false No
schedule_on_reconfiguration Whether a job reconfiguration triggers a scheduling algorithm invocation bool false No
scheduling_interval Invocation interval of the scheduling algorithm integer (seconds) 0 (disabled) Yes, if schedule_on_job_submit or schedule_on_job_finalize is false
min_scheduling_interval Minimum time between two scheduling algorithm invocations integer (seconds) 0 (disabled) No
allow_oversubscription Whether the scheduler can oversubscribe compute nodes with multiple jobs bool false No
clip_evolving_requests Whether evolving requests are clipped to stay in the possible range of configurations (i.e., [num_nodes_min, num_nodes_max]) bool true No
forward_io_information Whether the scheduler receives I/O information (PFS read/write bandwidth and utilization) bool false No
job_kill_grace_period Time to wait to kill the job after exceeding its walltime integer (seconds) 0 No
show_progress_bar Whether the progress bar is shown (only shown when log level is higher than info) bool true No
sensing Whether the monitoring module is active bool false No
sensing_interval Interval of the monitoring module to sense platform utilization parameters integer (seconds) - Yes, if sensing is true
log_task_times Whether task time logging is active bool false No
pfs_read_links PFS read links to sense by the monitoring module array of strings empty array No
pfs_write_links PFS write links to sense by the monitoring module array of strings empty array No
job_statistics Output path to write the job statistics file string - Yes
node_utilization Output path to write the compute node utilization file string - Yes
cpu_utilization Output path to write the CPU utilization file string - Yes, if sensing is true
network_activity Output path to write the network activity file string - Yes, if sensing is true
pfs_utilization Output path to write the PFS utilization file string - Yes, if sensing is true
gpu_utilization Output path to write the GPU utilization file string - Yes, if sensing is true
task_times Output path to write the task times file string - Yes, if log_task_times is true

Setting the sensing interval too small can significantly increase simulation times, as the discrete-event simulation engine will fire an event at each sensing interval. Logging task times can also introduce a significant overhead.

ElastiSim supports node migration (i.e., transferring nodes from one job to another) in a single scheduling step when the decision is taken at an invocation triggered by a scheduling point (requires schedule_on_scheduling_point to be true) or evolving request.

schedule_on_reconfiguration triggers the scheduling algorithm after applying a pending resource reconfiguration but before executing a potential on_reconfiguration or on_expansion phase. This invocation trigger enables scheduling decisions when resources change their state, such as nodes becoming free after a shrink operation when a job reaches its next scheduling point. However, a job reconfigured during a scheduling point or evolving request will not trigger the algorithm again when the reconfiguration is applied.

Example configuration

{
  "jobs_file": "/path/to/jobs.json",
  "platform_file": "/path/to/platform.xml",
  "zmq_url": "ipc:///tmp/elastisim.ipc",
  "schedule_on_job_submit": true,
  "schedule_on_job_finalize": true,
  "schedule_on_scheduling_point": true,
  "scheduling_interval": 0,
  "min_scheduling_interval": 0,
  "allow_oversubscription": false,
  "clip_evolving_requests": true,
  "forward_io_information": true,
  "sensing": true,
  "sensing_interval": 1,
  "log_task_times": true,
  "pfs_read_links": ["PFS_read"],
  "pfs_write_links": ["PFS_write"],
  "job_statistics": "/path/to/job_statistics.csv",
  "cpu_utilization": "/path/to/cpu_utilization.csv",
  "node_utilization": "/path/to/node_utilization.csv",
  "network_activity": "/path/to/network_activity.csv",
  "pfs_utilization": "/path/to/pfs_utilization.csv",
  "gpu_utilization": "/path/to/gpu_utilization.csv",
  "task_times": "/path/to/task_times.csv"
}

Copyright © 2023, Technical University of Darmstadt.