Scaling
This guide covers configuring Aveloxis for different workload sizes, from a few dozen repos to hundreds of thousands.
Worker count recommendations
Workers are concurrent collection goroutines that each claim one repo at a time from the queue. The optimal worker count depends on how many API tokens you have.
Rule of thumb
1 worker per 2-3 API tokens.
Each worker makes sustained API calls for its repo. With round-robin key rotation, each worker cycles through the available keys. Too many workers relative to keys means workers frequently hit rate limits and spend time waiting.
Tokens |
Recommended Workers |
Throughput |
|---|---|---|
1 |
1 |
~4,985 req/hr |
2-3 |
1 |
~9,970-14,955 req/hr |
4-6 |
2 |
~19,940-29,910 req/hr |
8-12 |
4 |
~39,880-59,820 req/hr |
20-30 |
8 |
~99,700-149,550 req/hr |
50-74 |
16-24 |
~249,250-368,890 req/hr |
# Example: 8 tokens, 4 workers
aveloxis serve --workers 4 --monitor :5555
Too many workers
If you set workers higher than your token count can support:
Workers will frequently encounter rate-limited keys (remaining < 15)
Keys will be skipped until their reset window
Effective throughput may be lower than with fewer workers
No data loss or errors – just slower than optimal
Too few workers
If you have many tokens but few workers:
Keys go underutilized (their rate limits are not fully consumed)
Collection is slower than it could be
Perfectly safe, just leaving throughput on the table
Rate limit math
GitHub provides 5000 requests per hour per token. Aveloxis uses a buffer of 15 requests per token to avoid hitting the hard limit.
Effective requests per token per hour = 5000 - 15 = 4985
Total throughput = N tokens * 4985 req/hr
Estimating collection time
A typical GitHub repo with moderate activity (~500 issues, ~200 PRs) requires approximately 2000-5000 API requests for full historical collection. Subsequent incremental collections require far fewer requests (only new/updated items).
Repos |
Tokens |
Full Collection Time (estimate) |
|---|---|---|
100 |
4 |
~2-5 hours |
1,000 |
10 |
~1-3 days |
10,000 |
20 |
~1-2 weeks |
100,000 |
50 |
~2-4 months |
400,000 |
74 |
~6-12 months |
These are rough estimates. Actual time depends on repo sizes, API response times, and the facade/analysis phases.
Horizontal scaling
Multiple aveloxis serve instances can share the same queue for horizontal scaling. The Postgres-backed queue uses SELECT ... FOR UPDATE SKIP LOCKED for atomic job claiming, so no two instances will collect the same repo simultaneously.
Setup
All instances must point to the same PostgreSQL database (same
aveloxis.jsondatabase settings).Each instance should have its own
repo_clone_diron local storage (bare clones are not shared).Start each instance normally:
# Instance 1 (on server A)
aveloxis serve --workers 4 --monitor :5555
# Instance 2 (on server B)
aveloxis serve --workers 4 --monitor :5556
Considerations
API tokens are shared: All instances rotate through the same pool of tokens. The total throughput across all instances is still bounded by
N tokens * 4985 req/hr.Stale lock recovery: If an instance crashes, its locked jobs are automatically re-queued after 1 hour by any running instance.
Materialized view rebuild: The Saturday rebuild is triggered by each instance independently. The
CONCURRENTLYoption ensures this is safe, though the rebuild may run multiple times.
Database connection pool
Aveloxis automatically scales the database connection pool based on the worker count. The formula is workers + 15, with a minimum of 20. For example, --workers 30 uses a pool of 45 connections. Non-scheduler commands (web, api, migrate) use the default pool of 20.
PostgreSQL configuration
For multiple instances or high worker counts, ensure your PostgreSQL max_connections is sufficient:
max_connections = (workers + 15) * (number of Aveloxis instances) + connections for other clients
For example, 3 Aveloxis instances plus psql and monitoring tools:
max_connections = 20 * 3 + 10 = 70
Adjust in postgresql.conf:
max_connections = 100
Clone directory sizing
The collection.repo_clone_dir stores bare git clones that persist across collection cycles.
Sizing estimates
Repos |
Estimated Disk Usage |
|---|---|
100 |
5-50 GB |
1,000 |
50-500 GB |
10,000 |
500 GB - 5 TB |
100,000 |
5-50 TB |
400,000 |
20-100+ TB |
Sizes vary enormously depending on repo history sizes. Large repos like torvalds/linux can be 5+ GB as a bare clone, while small repos are under 1 MB.
Recommendations
Use SSD or NVMe storage for best performance. The facade phase does heavy sequential reads of git history.
Use a dedicated mount point so clone storage does not fill up your root filesystem.
NFS works but may slow the facade phase due to latency on small random reads during
git log.Full clones (temporary, used for analysis) are created inside the clone directory and deleted after each repo. They roughly double the disk usage of a bare clone temporarily.
{
"collection": {
"repo_clone_dir": "/data/aveloxis-repos"
}
}
Queue behavior
Many repos, few workers
When the queue has thousands of repos and only a few workers, repos are collected in priority order. Lower priority numbers are collected first. Repos at the same priority are collected in due-time order (oldest first).
Few repos, many workers
When the queue has fewer repos than workers, excess workers sit idle waiting for repos to become due for recollection (based on days_until_recollect).
Priority override
At any time, you can push a specific repo to the front:
aveloxis prioritize https://github.com/critical/repo
Or via the dashboard’s Boost button, or the REST API:
curl -X POST http://localhost:5555/api/prioritize/42
Sizing summary
Component |
Small (100 repos) |
Medium (10K repos) |
Large (400K repos) |
|---|---|---|---|
Tokens |
1-2 |
10-20 |
50-74+ |
Workers |
1 |
4-8 |
16-24 |
Clone disk |
50 GB |
5 TB |
50+ TB |
DB connections |
20 |
20 |
60 (3 instances) |
PostgreSQL RAM |
2 GB |
8 GB |
32+ GB |
Next steps
Configuration – set collection parameters
Monitoring – track collection progress
Troubleshooting – diagnose performance issues