Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,24 @@ End-to-end tests for Deckhouse storage components.
5. Write your test in `tests/<your-test-name>/<your-test-name>_test.go` (Section marked `---=== TESTS START HERE ===---`)
6. Run the test: `go test -timeout=240m -v ./tests/<your-test-name> -count=1`

### Run using an existing cluster (no VM creation)

Use this mode to run tests against a cluster that is already running (faster iterations, no virtualization/VM setup).

1. Set cluster creation mode to use existing cluster:
```bash
export TEST_CLUSTER_CREATE_MODE=alwaysUseExisting
```
2. Point SSH to the **test cluster** (the Kubernetes API master you want to run tests on):
- **Direct access:** `SSH_HOST` = IP/hostname of the cluster master, `SSH_USER` = user that can run `sudo cat /etc/kubernetes/admin.conf` on that host.
- **Via jump host:** set `SSH_JUMP_HOST`, `SSH_JUMP_USER`, `SSH_JUMP_KEY_PATH` (optional); `SSH_HOST`/`SSH_USER` are the target cluster master.
3. Source the rest of your test env (e.g. `source tests/<your-test-name>/test_exports`), then run:
```bash
go test -timeout=240m -v ./tests/<your-test-name> -count=1
```

Kubeconfig is written to `temp/<test-name>/` (e.g. `temp/sds_node_configurator_test/kubeconfig-<master-ip>.yml`). The framework acquires a cluster lock so only one test run uses the cluster at a time. If a previous run left the lock (crash, Ctrl+C), set `TEST_CLUSTER_FORCE_LOCK_RELEASE=true` for the next run (do not use if another test might be using the cluster).

The `-count=1` flag prevents Go from using cached test results.
Timeout `240m` is a global timeout for entire testkit. Adjust it on your needs.

Expand Down Expand Up @@ -48,6 +66,23 @@ Designed to validate any CSI driver stability under high load with concurrent PV

Run the test: `go test -timeout=120m -v ./tests/csi-all-stress-tests -count=1`

### sds-node-configurator-stress-tests

Stress test for **sds-node-configurator**: ramps independent **LVMVolumeGroups** on a single node (one VirtualDisk → one BlockDevice → one VG per slot). Empirically finds how many VGs the node and agent can reconcile; LVM2 has no fixed VG count limit.

- Creates a nested cluster with `sds-node-configurator` and `sds-local-volume` (see `cluster_config.yml`)
- Implementation: `pkg/testkit/snc_max_vgs_stress.go` (`MaxVGsStressRunner`)
- Probe mode (default): pass if at least one LVG is `Ready`; see report in logs for the empirical ceiling
- Strict mode: `STRESS_MAX_VG_STRICT=true` requires all `STRESS_MAX_VG_TARGET` to become `Ready`

Run:

```bash
go test -timeout=240m -v ./tests/sds-node-configurator-stress-tests -count=1
```

Tuning: `STRESS_MAX_VG_TARGET` (default `30`), `STRESS_MAX_VG_BATCH_SIZE` (default `5`), `STRESS_MAX_VG_DISK_SIZE` (default `1Gi`), `STRESS_MAX_VG_MIN_READY`, `STRESS_MAX_VG_STRICT`.


## Functions Glossary (exportable only)

Expand All @@ -71,6 +106,7 @@ See [pkg/FUNCTIONS_GLOSSARY.md](pkg/FUNCTIONS_GLOSSARY.md) for a full list of al
- `SSH_PUBLIC_KEY` -- Path to SSH public key file, or plain-text key content. Default: `~/.ssh/id_rsa.pub`
- `SSH_PASSPHRASE` -- Passphrase for the SSH private key. Required for non-interactive mode with encrypted keys
- `SSH_VM_USER` -- SSH user for connecting to VMs deployed inside the test cluster. Default: `cloud`
- `SSH_VM_PASSWORD` -- Password for SSH to VMs (e.g. `cloud`) when connecting from jump host for lsblk checks. If set, uses `sshpass`; leave empty for key-based auth. Required when VMs accept only password auth.
- `SSH_JUMP_HOST` -- Jump host address for connecting to clusters behind a bastion
- `SSH_JUMP_USER` -- Jump host SSH user. Defaults to `SSH_USER` if jump host is set
- `SSH_JUMP_KEY_PATH` -- Jump host SSH key path. Defaults to `SSH_PRIVATE_KEY` if jump host is set
Expand All @@ -79,8 +115,10 @@ See [pkg/FUNCTIONS_GLOSSARY.md](pkg/FUNCTIONS_GLOSSARY.md) for a full list of al

- `YAML_CONFIG_FILENAME` -- Filename of the cluster definition YAML. Default: `cluster_config.yml`
- `TEST_CLUSTER_CLEANUP` -- Set to `true` to remove the test cluster after tests complete. Default: `false`
- `TEST_CLUSTER_RESUME` -- Set to `true` to continue from a previous failed run (only for `alwaysCreateNew`). If the test failed in the middle of cluster creation, re-run with `TEST_CLUSTER_RESUME=true`; the framework will load saved state from `temp/<test-name>/cluster-state.json` (written after step 6), restore VM hostnames, and run the remaining steps (connect to first master, add nodes, enable modules). Requires that step 6 (VMs created, VM info gathered) completed before the failure.
- `TEST_CLUSTER_NAMESPACE` -- Namespace for DKP cluster deployment. Default: `e2e-test-cluster`
- `KUBE_CONFIG_PATH` -- Path to a kubeconfig file. Used as fallback if SSH-based kubeconfig retrieval fails
- `KUBE_INSECURE_SKIP_TLS_VERIFY` -- Set to `true` to skip TLS certificate verification for the Kubernetes API (e.g. self-signed certs or tunnel to 127.0.0.1). Default: not set (verify TLS)
- `IMAGE_PULL_POLICY` -- Image pull policy for ClusterVirtualImages: `Always` or `IfNotExists`. Default: `IfNotExists`

### Logging
Expand Down Expand Up @@ -116,3 +154,11 @@ See [pkg/FUNCTIONS_GLOSSARY.md](pkg/FUNCTIONS_GLOSSARY.md) for a full list of al
- `STRESS_TEST_MAX_ATTEMPTS` -- Maximum attempts for waiting operations. Default: `360`
- `STRESS_TEST_INTERVAL` -- Interval between attempts in seconds. Default: `5`
- `STRESS_TEST_CLEANUP` -- Whether to cleanup resources after stress tests. Default: `true`

**sds-node-configurator max-VG stress** (`sds-node-configurator-stress-tests`):

- `STRESS_MAX_VG_TARGET` -- How many independent LVMVolumeGroups to attempt. Default: `30`
- `STRESS_MAX_VG_BATCH_SIZE` -- Ramp batch size. Default: `5`
- `STRESS_MAX_VG_DISK_SIZE` -- VirtualDisk size per slot. Default: `1Gi`
- `STRESS_MAX_VG_STRICT` -- If `true`, fail unless all targets are Ready. Default: `false` (probe)
- `STRESS_MAX_VG_MIN_READY` -- Minimum Ready count in probe mode. Default: `1`
36 changes: 19 additions & 17 deletions internal/cluster/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,8 @@ func expandPath(path string) (string, error) {
// and returns a rest.Config that can be used with Kubernetes clients, along with the path to the kubeconfig file.
// If sshClient is provided, it will be used instead of creating a new connection.
// If sshClient is nil, a new connection will be created and closed automatically.
func GetKubeconfig(ctx context.Context, masterIP, user, keyPath string, sshClient ssh.SSHClient) (*rest.Config, string, error) {
// If kubeconfigOutputDir is non-empty, the kubeconfig file is written there; otherwise temp/<caller-file-name>/ is used.
func GetKubeconfig(ctx context.Context, masterIP, user, keyPath string, sshClient ssh.SSHClient, kubeconfigOutputDir string) (*rest.Config, string, error) {
// Create SSH client if not provided
shouldClose := false
if sshClient == nil {
Expand All @@ -198,23 +199,24 @@ func GetKubeconfig(ctx context.Context, masterIP, user, keyPath string, sshClien
defer sshClient.Close()
}

// Get the test file name from the caller
_, callerFile, _, ok := runtime.Caller(1)
if !ok {
return nil, "", fmt.Errorf("failed to get caller file information")
}
testFileName := strings.TrimSuffix(filepath.Base(callerFile), filepath.Ext(callerFile))

// Determine the temp directory path in the repo root
// callerFile is in tests/{test-dir}/, so we go up two levels to reach repo root
callerDir := filepath.Dir(callerFile)
repoRootPath := filepath.Join(callerDir, "..", "..")
// Resolve the .. parts to get absolute path
repoRoot, err := filepath.Abs(repoRootPath)
if err != nil {
return nil, "", fmt.Errorf("failed to resolve repo root path: %w", err)
var tempDir string
if kubeconfigOutputDir != "" {
tempDir = kubeconfigOutputDir
} else {
// Get the test file name from the caller (creates temp/cluster when called from pkg/cluster)
_, callerFile, _, ok := runtime.Caller(1)
if !ok {
return nil, "", fmt.Errorf("failed to get caller file information")
}
testFileName := strings.TrimSuffix(filepath.Base(callerFile), filepath.Ext(callerFile))
callerDir := filepath.Dir(callerFile)
repoRootPath := filepath.Join(callerDir, "..", "..")
repoRoot, err := filepath.Abs(repoRootPath)
if err != nil {
return nil, "", fmt.Errorf("failed to resolve repo root path: %w", err)
}
tempDir = filepath.Join(repoRoot, "temp", testFileName)
}
tempDir := filepath.Join(repoRoot, "temp", testFileName)

// Create temp directory if it doesn't exist
if err := os.MkdirAll(tempDir, 0755); err != nil {
Expand Down
2 changes: 1 addition & 1 deletion internal/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ const (
// Kubernetes operations
ModuleCheckTimeout = 10 * time.Second // Timeout for checking module status
NamespaceTimeout = 30 * time.Second // Timeout for creating namespace
NodeGroupTimeout = 3 * time.Second // Timeout for creating NodeGroup
NodeGroupTimeout = 2 * time.Minute // Timeout for creating NodeGroup (API can be slow right after bootstrap)
SecretsWaitTimeout = 2 * time.Minute // Timeout for waiting for bootstrap secrets to appear
ClusterHealthTimeout = 15 * time.Minute // Timeout for cluster health check
ModuleDeployTimeout = 15 * time.Minute // Timeout for waiting for ONE module to be ready
Expand Down
7 changes: 7 additions & 0 deletions internal/config/env.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@ var (
// SSH credentials to deploy to VM
VMSSHUser = os.Getenv("SSH_VM_USER")
VMSSHUserDefaultValue = "cloud"
// VMSSHPassword when set is used to SSH from jump host to VMs (cloud@vmIP) via sshpass. Leave empty for key-based auth.
VMSSHPassword = os.Getenv("SSH_VM_PASSWORD")

// KubeConfigPath is the path to a kubeconfig file. If SSH retrieval fails (e.g., sudo requires password),
// this path will be used as a fallback. If not set and SSH fails, the user will be notified to download
Expand All @@ -87,6 +89,11 @@ var (
TestClusterNamespace = os.Getenv("TEST_CLUSTER_NAMESPACE")
TestClusterNamespaceDefaultValue = "e2e-test-cluster"

// TestClusterResume when set to "true" or "True" (only for alwaysCreateNew) tries to continue from a previous
// failed run: if state was saved after step 6 (VMs created, IPs gathered), connects to the first master and
// runs remaining steps (add nodes, enable modules). Set to "true" and re-run the test after a mid-deploy failure.
TestClusterResume = os.Getenv("TEST_CLUSTER_RESUME")

// TestClusterStorageClass specifies the storage class for DKP cluster deployment
TestClusterStorageClass = os.Getenv("TEST_CLUSTER_STORAGE_CLASS")
//TestClusterStorageClassDefaultValue = "rsc-test-r2-local"
Expand Down
25 changes: 25 additions & 0 deletions internal/kubernetes/storage/lvmvolumegroup.go
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,31 @@ func (c *LVMVolumeGroupClient) Get(ctx context.Context, name string) (*snc.LVMVo
return &lvg, nil
}

// CreateWithMatchLabels creates an LVMVolumeGroup bound to block devices selected by label (typical: hostname + metadata.name).
func (c *LVMVolumeGroupClient) CreateWithMatchLabels(ctx context.Context, name, nodeName, actualVGName string, matchLabels map[string]string) error {
lvg := &snc.LVMVolumeGroup{
TypeMeta: metav1.TypeMeta{
APIVersion: snc.SchemeGroupVersion.String(),
Kind: "LVMVolumeGroup",
},
ObjectMeta: metav1.ObjectMeta{Name: name},
Spec: snc.LVMVolumeGroupSpec{
Type: "Local",
Local: snc.LVMVolumeGroupLocalSpec{
NodeName: nodeName,
},
BlockDeviceSelector: &metav1.LabelSelector{
MatchLabels: matchLabels,
},
ActualVGNameOnTheNode: actualVGName,
},
}
if err := c.client.Create(ctx, lvg); err != nil {
return fmt.Errorf("failed to create LVMVolumeGroup %s: %w", name, err)
}
return nil
}

// Create creates a new LVMVolumeGroup for a specific node
func (c *LVMVolumeGroupClient) Create(ctx context.Context, name, nodeName string, blockDeviceNames []string, actualVGName string) error {
lvg := &snc.LVMVolumeGroup{
Expand Down
Loading