Scaling

This guide covers configuring Aveloxis for different workload sizes, from a few dozen repos to hundreds of thousands.

Worker count recommendations

Workers are concurrent collection goroutines that each claim one repo at a time from the queue. The optimal worker count depends on how many API tokens you have.

Rule of thumb

1 worker per 2-3 API tokens.

Each worker makes sustained API calls for its repo. With round-robin key rotation, each worker cycles through the available keys. Too many workers relative to keys means workers frequently hit rate limits and spend time waiting.

Tokens	Recommended Workers	Throughput
1	1	~4,985 req/hr
2-3	1	~9,970-14,955 req/hr
4-6	2	~19,940-29,910 req/hr
8-12	4	~39,880-59,820 req/hr
20-30	8	~99,700-149,550 req/hr
50-74	16-24	~249,250-368,890 req/hr

# Example: 8 tokens, 4 workers
aveloxis serve --workers 4 --monitor :5555

Too many workers

If you set workers higher than your token count can support:

Workers will frequently encounter rate-limited keys (remaining < 15)
Keys will be skipped until their reset window
Effective throughput may be lower than with fewer workers
No data loss or errors – just slower than optimal

Too few workers

If you have many tokens but few workers:

Keys go underutilized (their rate limits are not fully consumed)
Collection is slower than it could be
Perfectly safe, just leaving throughput on the table

Rate limit math

GitHub provides 5000 requests per hour per token. Aveloxis uses a buffer of 15 requests per token to avoid hitting the hard limit.

Effective requests per token per hour = 5000 - 15 = 4985
Total throughput = N tokens * 4985 req/hr

Estimating collection time

A typical GitHub repo with moderate activity (~500 issues, ~200 PRs) requires approximately 2000-5000 API requests for full historical collection. Subsequent incremental collections require far fewer requests (only new/updated items).

Repos	Tokens	Full Collection Time (estimate)
100	4	~2-5 hours
1,000	10	~1-3 days
10,000	20	~1-2 weeks
100,000	50	~2-4 months
400,000	74	~6-12 months

These are rough estimates. Actual time depends on repo sizes, API response times, and the facade/analysis phases.

Horizontal scaling

Multiple aveloxis serve instances can share the same queue for horizontal scaling. The Postgres-backed queue uses SELECT ... FOR UPDATE SKIP LOCKED for atomic job claiming, so no two instances will collect the same repo simultaneously.

Setup

All instances must point to the same PostgreSQL database (same aveloxis.json database settings).
Each instance should have its own repo_clone_dir on local storage (bare clones are not shared).
Start each instance normally:

# Instance 1 (on server A)
aveloxis serve --workers 4 --monitor :5555

# Instance 2 (on server B)
aveloxis serve --workers 4 --monitor :5556

What is shared

Resource	Shared?	Notes
PostgreSQL database	Yes	All data and queue state
API tokens	Yes	All instances draw from the same token pool in `worker_oauth`
Bare clones	No	Each instance needs its own clone directory
Dashboard	No	Each instance serves its own dashboard

Considerations

API tokens are shared: All instances rotate through the same pool of tokens. The total throughput across all instances is still bounded by N tokens * 4985 req/hr.
Stale lock recovery: If an instance crashes, its locked jobs are automatically re-queued after 1 hour by any running instance.
Materialized view rebuild: The Saturday rebuild is triggered by each instance independently. The CONCURRENTLY option ensures this is safe, though the rebuild may run multiple times.

Database connection pool

Aveloxis automatically scales the database connection pool based on the worker count. The formula is workers + 15, with a minimum of 20. For example, --workers 30 uses a pool of 45 connections. Non-scheduler commands (web, api, migrate) use the default pool of 20.

PostgreSQL configuration

For multiple instances or high worker counts, ensure your PostgreSQL max_connections is sufficient:

max_connections = (workers + 15) * (number of Aveloxis instances) + connections for other clients

For example, 3 Aveloxis instances plus psql and monitoring tools:

max_connections = 20 * 3 + 10 = 70

Adjust in postgresql.conf:

max_connections = 100

Shared buffers

For large datasets (millions of rows across tables), increase PostgreSQL shared buffers:

shared_buffers = 4GB          # 25% of available RAM
effective_cache_size = 12GB   # 75% of available RAM
work_mem = 256MB              # for complex queries and matview refreshes
maintenance_work_mem = 1GB    # for VACUUM and index creation

Clone directory sizing

The collection.repo_clone_dir stores bare git clones that persist across collection cycles.

Sizing estimates

Repos	Estimated Disk Usage
100	5-50 GB
1,000	50-500 GB
10,000	500 GB - 5 TB
100,000	5-50 TB
400,000	20-100+ TB

Sizes vary enormously depending on repo history sizes. Large repos like torvalds/linux can be 5+ GB as a bare clone, while small repos are under 1 MB.

Recommendations

Use SSD or NVMe storage for best performance. The facade phase does heavy sequential reads of git history.
Use a dedicated mount point so clone storage does not fill up your root filesystem.
NFS works but may slow the facade phase due to latency on small random reads during git log.
Full clones (temporary, used for analysis) are created inside the clone directory and deleted after each repo. They roughly double the disk usage of a bare clone temporarily.

{
  "collection": {
    "repo_clone_dir": "/data/aveloxis-repos"
  }
}

Queue behavior

Many repos, few workers

When the queue has thousands of repos and only a few workers, repos are collected in priority order. Lower priority numbers are collected first. Repos at the same priority are collected in due-time order (oldest first).

Few repos, many workers

When the queue has fewer repos than workers, excess workers sit idle waiting for repos to become due for recollection (based on days_until_recollect).

Priority override

At any time, you can push a specific repo to the front:

aveloxis prioritize https://github.com/critical/repo

Or via the dashboard’s Boost button, or the REST API:

curl -X POST http://localhost:5555/api/prioritize/42

Sizing summary

Component	Small (100 repos)	Medium (10K repos)	Large (400K repos)
Tokens	1-2	10-20	50-74+
Workers	1	4-8	16-24
Clone disk	50 GB	5 TB	50+ TB
DB connections	20	20	60 (3 instances)
PostgreSQL RAM	2 GB	8 GB	32+ GB

Next steps

Configuration – set collection parameters
Monitoring – track collection progress
Troubleshooting – diagnose performance issues