# Scaling

This guide covers configuring Aveloxis for different workload sizes, from a few dozen repos to hundreds of thousands.

---

## Worker count recommendations

Workers are concurrent collection goroutines that each claim one repo at a time from the queue. The optimal worker count depends on how many API tokens you have.

### Rule of thumb

**1 worker per 2-3 API tokens.**

Each worker makes sustained API calls for its repo. With round-robin key rotation, each worker cycles through the available keys. Too many workers relative to keys means workers frequently hit rate limits and spend time waiting.

| Tokens | Recommended Workers | Throughput |
|---|---|---|
| 1 | 1 | ~4,985 req/hr |
| 2-3 | 1 | ~9,970-14,955 req/hr |
| 4-6 | 2 | ~19,940-29,910 req/hr |
| 8-12 | 4 | ~39,880-59,820 req/hr |
| 20-30 | 8 | ~99,700-149,550 req/hr |
| 50-74 | 16-24 | ~249,250-368,890 req/hr |

```bash
# Example: 8 tokens, 4 workers
aveloxis serve --workers 4 --monitor :5555
```

### Too many workers

If you set workers higher than your token count can support:

- Workers will frequently encounter rate-limited keys (remaining < 15)
- Keys will be skipped until their reset window
- Effective throughput may be lower than with fewer workers
- No data loss or errors -- just slower than optimal

### Too few workers

If you have many tokens but few workers:

- Keys go underutilized (their rate limits are not fully consumed)
- Collection is slower than it could be
- Perfectly safe, just leaving throughput on the table

---

## Rate limit math

GitHub provides 5000 requests per hour per token. Aveloxis uses a buffer of 15 requests per token to avoid hitting the hard limit.

```
Effective requests per token per hour = 5000 - 15 = 4985
Total throughput = N tokens * 4985 req/hr
```

### Estimating collection time

A typical GitHub repo with moderate activity (~500 issues, ~200 PRs) requires approximately 2000-5000 API requests for full historical collection. Subsequent incremental collections require far fewer requests (only new/updated items).

| Repos | Tokens | Full Collection Time (estimate) |
|---|---|---|
| 100 | 4 | ~2-5 hours |
| 1,000 | 10 | ~1-3 days |
| 10,000 | 20 | ~1-2 weeks |
| 100,000 | 50 | ~2-4 months |
| 400,000 | 74 | ~6-12 months |

These are rough estimates. Actual time depends on repo sizes, API response times, and the facade/analysis phases.

---

## Horizontal scaling

Multiple `aveloxis serve` instances can share the same queue for horizontal scaling. The Postgres-backed queue uses `SELECT ... FOR UPDATE SKIP LOCKED` for atomic job claiming, so no two instances will collect the same repo simultaneously.

### Setup

1. All instances must point to the same PostgreSQL database (same `aveloxis.json` database settings).
2. Each instance should have its own `repo_clone_dir` on local storage (bare clones are not shared).
3. Start each instance normally:

```bash
# Instance 1 (on server A)
aveloxis serve --workers 4 --monitor :5555

# Instance 2 (on server B)
aveloxis serve --workers 4 --monitor :5556
```

### What is shared

| Resource | Shared? | Notes |
|---|---|---|
| PostgreSQL database | Yes | All data and queue state |
| API tokens | Yes | All instances draw from the same token pool in `worker_oauth` |
| Bare clones | No | Each instance needs its own clone directory |
| Dashboard | No | Each instance serves its own dashboard |

### Considerations

- **API tokens are shared:** All instances rotate through the same pool of tokens. The total throughput across all instances is still bounded by `N tokens * 4985 req/hr`.
- **Stale lock recovery:** If an instance crashes, its locked jobs are automatically re-queued after 1 hour by any running instance.
- **Materialized view rebuild:** The Saturday rebuild is triggered by each instance independently. The `CONCURRENTLY` option ensures this is safe, though the rebuild may run multiple times.

---

## Database connection pool

Aveloxis automatically scales the database connection pool based on the worker count. The formula is `workers + 15`, with a minimum of 20. For example, `--workers 30` uses a pool of 45 connections. Non-scheduler commands (web, api, migrate) use the default pool of 20.

### PostgreSQL configuration

For multiple instances or high worker counts, ensure your PostgreSQL `max_connections` is sufficient:

```
max_connections = (workers + 15) * (number of Aveloxis instances) + connections for other clients
```

For example, 3 Aveloxis instances plus psql and monitoring tools:

```
max_connections = 20 * 3 + 10 = 70
```

Adjust in `postgresql.conf`:

```ini
max_connections = 100
```

### Shared buffers

For large datasets (millions of rows across tables), increase PostgreSQL shared buffers:

```ini
shared_buffers = 4GB          # 25% of available RAM
effective_cache_size = 12GB   # 75% of available RAM
work_mem = 256MB              # for complex queries and matview refreshes
maintenance_work_mem = 1GB    # for VACUUM and index creation
```

---

## Clone directory sizing

The `collection.repo_clone_dir` stores bare git clones that persist across collection cycles.

### Sizing estimates

| Repos | Estimated Disk Usage |
|---|---|
| 100 | 5-50 GB |
| 1,000 | 50-500 GB |
| 10,000 | 500 GB - 5 TB |
| 100,000 | 5-50 TB |
| 400,000 | 20-100+ TB |

Sizes vary enormously depending on repo history sizes. Large repos like `torvalds/linux` can be 5+ GB as a bare clone, while small repos are under 1 MB.

### Recommendations

- Use **SSD or NVMe** storage for best performance. The facade phase does heavy sequential reads of git history.
- Use a **dedicated mount point** so clone storage does not fill up your root filesystem.
- **NFS** works but may slow the facade phase due to latency on small random reads during `git log`.
- **Full clones** (temporary, used for analysis) are created inside the clone directory and deleted after each repo. They roughly double the disk usage of a bare clone temporarily.

```json
{
  "collection": {
    "repo_clone_dir": "/data/aveloxis-repos"
  }
}
```

---

## Queue behavior

### Many repos, few workers

When the queue has thousands of repos and only a few workers, repos are collected in priority order. Lower priority numbers are collected first. Repos at the same priority are collected in due-time order (oldest first).

### Few repos, many workers

When the queue has fewer repos than workers, excess workers sit idle waiting for repos to become due for recollection (based on `days_until_recollect`).

### Priority override

At any time, you can push a specific repo to the front:

```bash
aveloxis prioritize https://github.com/critical/repo
```

Or via the dashboard's Boost button, or the REST API:

```bash
curl -X POST http://localhost:5555/api/prioritize/42
```

---

## Sizing summary

| Component | Small (100 repos) | Medium (10K repos) | Large (400K repos) |
|---|---|---|---|
| Tokens | 1-2 | 10-20 | 50-74+ |
| Workers | 1 | 4-8 | 16-24 |
| Clone disk | 50 GB | 5 TB | 50+ TB |
| DB connections | 20 | 20 | 60 (3 instances) |
| PostgreSQL RAM | 2 GB | 8 GB | 32+ GB |

---

## Next steps

- [Configuration](../getting-started/configuration.md) -- set collection parameters
- [Monitoring](monitoring.md) -- track collection progress
- [Troubleshooting](troubleshooting.md) -- diagnose performance issues