Knowledge Sync¶

OSA includes a knowledge discovery system that syncs community-specific content from multiple sources. Each community gets its own SQLite Full-Text Search 5 (FTS5) database at data/knowledge/{community_id}.db, populated by sync commands.

Overview¶

The knowledge system supports seven sync types:

Sync Type	Source	Description
GitHub	GitHub REST API	Issues and Pull Requests (PRs) from community repositories
Papers	OpenALEX, Semantic Scholar, PubMed	Academic papers and citation tracking
Docstrings	GitHub repos	MATLAB/Python function documentation
Mailman	Mailman archives	Mailing list messages
FAQ	Large Language Model (LLM) summarization	Frequently Asked Questions (FAQ) entries generated from mailing list threads
Discourse	Discourse public JSON API	Forum topics with first post and accepted answers
BEPs	bids-website + GitHub PRs	BIDS Extension Proposals (BIDS community only)

Discovery, Not Answers

The knowledge system is for discovery only. The assistant links users to relevant discussions ("There's a related issue: [link]") rather than answering from them.

Quick Start¶

# Initialize database for a community
uv run osa sync init --community hed

# Initialize databases for all communities
uv run osa sync init

# Sync GitHub issues/PRs
uv run osa sync github --community hed

# Sync academic papers (includes citation tracking)
uv run osa sync papers --community bids

# Sync everything for a community
uv run osa sync all --community eeglab

# Sync everything for all communities
uv run osa sync all

# Check sync status
uv run osa sync status

CLI Commands¶

All sync commands accept --community/-c to specify the target community. Most default to hed if omitted.

`osa sync init`¶

Initialize the knowledge database with FTS5 full-text search support.

# Initialize for a specific community
uv run osa sync init --community bids

# Initialize for all communities
uv run osa sync init

`osa sync github`¶

Sync GitHub issues and PRs from community repositories.

# Sync all repos for a community
uv run osa sync github --community hed

# Sync a specific repo
uv run osa sync github --community hed -r hed-standard/hed-specification

# Full sync (not incremental)
uv run osa sync github --community bids --full

Options:

Option	Description
`-c, --community`	Community ID (default: `hed`)
`-r, --repo`	Specific repository to sync
`--full`	Full sync instead of incremental

Repositories are configured per-community in config.yaml under github.repos.

`osa sync papers`¶

Sync academic papers from multiple sources.

# Sync papers for a community (includes citations by default)
uv run osa sync papers --community bids

# Sync from a specific source
uv run osa sync papers --community hed -s openalex

# Custom search query
uv run osa sync papers --community hed -q "event annotation EEG"

# Disable citation tracking
uv run osa sync papers --community hed --no-citations

Options:

Option	Description
`-c, --community`	Community ID (default: `hed`)
`-s, --source`	Paper source: `openalex`, `semanticscholar`, `pubmed`
`-q, --query`	Custom search query (overrides community config)
`-l, --limit`	Max papers per query (default: `100`)
`--citations / --no-citations`	Sync papers citing community DOIs (default: enabled)

Citation Tracking

When --citations is enabled (the default), OSA also syncs papers that cite the DOIs listed in the community's citations.dois config. This automatically discovers new research that builds on the community's core publications.

Automatic Deduplication

Papers are deduplicated using fuzzy title matching. The same paper from different sources is only shown once in search results.

`osa sync docstrings`¶

Sync code docstrings from GitHub repositories. Extracts documentation from MATLAB (.m) or Python (.py) files and indexes them for search.

# Sync all configured repos for a community
uv run osa sync docstrings --community eeglab --language matlab

# Sync Python docstrings
uv run osa sync docstrings --community eeglab --language python

# Sync a specific repo
uv run osa sync docstrings --community eeglab -r sccn/eeglab -b develop

Options:

Option	Description
`-c, --community`	Community ID (default: `hed`)
`-l, --language`	Language: `matlab` or `python` (default: `matlab`)
`-r, --repo`	Single repo to sync (`owner/name` format)
`-b, --branch`	Branch to sync from (default: `main`)

Repositories and their branches are configured in config.yaml under docstrings.repos.

`osa sync mailman`¶

Sync mailing list messages from Mailman archives.

# Sync all configured mailing lists for a community
uv run osa sync mailman --community eeglab

# Sync a specific mailing list
uv run osa sync mailman --community eeglab --list eeglablist

# Sync a specific year range
uv run osa sync mailman --community eeglab --start-year 2020 --end-year 2025

Options:

Option	Description
`-c, --community`	Community ID (default: `eeglab`)
`--list`	Specific mailing list name
`--start-year`	Earliest year to sync
`--end-year`	Latest year to sync

Mailing lists are configured in config.yaml under mailman.

`osa sync faq`¶

Generate FAQ summaries from mailing list threads using a two-stage LLM pipeline.

# Estimate cost first
uv run osa sync faq --community eeglab --estimate

# Generate FAQs with quality threshold
uv run osa sync faq --community eeglab --quality 0.7

# Limit number of threads to process
uv run osa sync faq --community eeglab --max 100

Options:

Option	Description
`-c, --community`	Community ID (default: `eeglab`)
`--estimate`	Estimate cost without processing
`--quality`	Minimum quality score, 0.0-1.0 (default: `0.6`)
`--max`	Maximum threads to process

LLM Costs

FAQ generation uses LLM calls. Always run with --estimate first to understand costs. The two-agent pipeline (evaluation + summarization) reduces costs by filtering low-quality threads early.

`osa sync beps`¶

Sync BIDS Extension Proposals (BEPs) from the BIDS website and specification PRs.

uv run osa sync beps --community bids

Options:

Option	Description
`-c, --community`	Community ID (default: `bids`)

`osa sync discourse`¶

Sync Discourse forum topics from a community's Discourse instance. Uses the public JSON API (no authentication required).

# Sync forum topics for a community
uv run osa sync discourse --community mne

# Full sync (not incremental)
uv run osa sync discourse --community mne --full

# Limit number of topics (for testing)
uv run osa sync discourse --community mne --max 10

Options:

Option	Description
`-c, --community`	Community ID (default: `mne`)
`--full`	Full sync instead of incremental
`--max`	Maximum topics to sync (for testing)

The syncer uses patient rate limiting (1 request per second by default) to respect the Discourse API limits (200 requests per minute). Topics are stored with their first post and either the accepted answer or the highest-voted reply.

Discourse forums are configured in config.yaml under discourse.

`osa sync all`¶

Sync all knowledge sources for one or all communities. Runs GitHub, papers, Discourse, and BEPs (for BIDS) in sequence.

# Sync everything for one community
uv run osa sync all --community hed

# Sync everything for all communities
uv run osa sync all

# Full non-incremental sync
uv run osa sync all --community bids --full

`osa sync status`¶

Show sync status and database statistics. This is a read-only command and does not require API keys.

# Show status for all communities
uv run osa sync status

# Show status for a specific community
uv run osa sync status --community bids

`osa sync search`¶

Search the knowledge database (useful for testing).

# Search a community's knowledge base
uv run osa sync search "validation error" --community hed

# Filter by source
uv run osa sync search "ICA" --community eeglab --source github

Automated Sync¶

When running OSA in production, each community has its own sync schedule configured in its config.yaml under the sync key. Schedules use APScheduler cron expressions (UTC timezone).

Example from EEGLAB config:

sync:
  github:
    cron: "0 2 * * *"       # daily at 2am UTC
  papers:
    cron: "0 3 * * 0"       # weekly Sunday at 3am UTC
  docstrings:
    cron: "0 4 * * 1"       # weekly Monday at 4am UTC
  mailman:
    cron: "0 5 * * 1"       # weekly Monday at 5am UTC
  faq:
    cron: "0 6 1 * *"       # monthly 1st at 6am UTC

Not all sync types are required. A community only needs schedules for the sync types it uses.

Database Location¶

Each community has its own SQLite database:

Environment	Location
Local (macOS)	`~/Library/Application Support/osa/knowledge/{community_id}.db`
Local (Linux)	`~/.local/share/osa/knowledge/{community_id}.db`
Docker	`/app/data/knowledge/{community_id}.db`

The location can be overridden with the DATA_DIR environment variable.

API Keys¶

All API keys are optional but recommended for higher rate limits:

API Key	Purpose	Required For
`GITHUB_TOKEN`	GitHub API (issues/PRs)	`sync github`
`SEMANTIC_SCHOLAR_API_KEY`	Semantic Scholar API	`sync papers -s semanticscholar`
`PUBMED_API_KEY`	PubMed/NCBI API	`sync papers -s pubmed`
`OPENALEX_EMAIL`	OpenALEX polite pool	`sync papers -s openalex`

Note

sync status and sync search are read-only and do not require API keys.

Without keys, rate limits apply:

GitHub: 60 requests/hour
Semantic Scholar: ~100 requests/5 minutes
PubMed: 3 requests/second

Agent Tools¶

Each community's assistant gets knowledge discovery tools based on its configuration:

Tool	Description	Config Requirement
`search_{community}_discussions`	Search GitHub issues and PRs	`github.repos`
`list_{community}_recent`	List recent GitHub activity	`github.repos`
`search_{community}_papers`	Search academic papers	`citations`
`search_{community}_code_docs`	Search code docstrings	`docstrings`
`search_{community}_faq`	Search mailing list FAQ	`mailman` + `faq_generation`
`search_{community}_forum`	Search Discourse forum topics	`discourse`

See the Tools section for detailed tool documentation per community.

Troubleshooting¶

Rate limiting¶

If you hit rate limits, configure API keys in your .env file. See the API Keys section above.

Database corruption¶

If the database becomes corrupted, delete and reinitialize:

# Find the database location
uv run osa sync status --community hed

# Delete and reinitialize
rm ~/.local/share/osa/knowledge/hed.db
uv run osa sync init --community hed
uv run osa sync all --community hed

Sync shows 0 items¶

Check that the community's config.yaml has the relevant configuration (e.g., github.repos for GitHub sync)
Ensure API keys are set if required
Run with --full to bypass incremental sync

Knowledge Sync¶

Overview¶

Quick Start¶

CLI Commands¶

osa sync init¶

osa sync github¶

osa sync papers¶

osa sync docstrings¶

osa sync mailman¶

osa sync faq¶

osa sync beps¶

osa sync discourse¶

osa sync all¶

osa sync status¶

osa sync search¶

Automated Sync¶

Database Location¶

API Keys¶

Agent Tools¶

Troubleshooting¶

Rate limiting¶

Database corruption¶

Sync shows 0 items¶

`osa sync init`¶

`osa sync github`¶

`osa sync papers`¶

`osa sync docstrings`¶

`osa sync mailman`¶

`osa sync faq`¶

`osa sync beps`¶

`osa sync discourse`¶

`osa sync all`¶

`osa sync status`¶

`osa sync search`¶