🔬 ScholarEnv
The first RL environment for AI-assisted peer review and scholarly integrity verification
An AI agent that investigates papers — not one that produces them.
API Reference · Quick Start · Research
Nensi Pansuriya · Krushna Parmar · Ishita Bhojani
Meta × PyTorch OpenEnv Hackathon · Round 1 · April 2026
Why This Exists
~10,000 papers are retracted every year. Every major journal — Nature, Science, IEEE, ACM — has a manual integrity screening bottleneck at scale. StatCheck found errors in ~50% of psychology papers in top journals.
The key insight: LLMs are already good at formatting. They fail at auditing.
Ask GPT-4o to format a manuscript → scores ~0.92 with no training. Ask GPT-4o to find numerical claim mismatches in a paper → scores 0.20–0.45.
That gap is exactly where RL adds value. The agent must discover a document traversal strategy — which sections to read first, which tables to cross-reference — that varies by paper structure and cannot be reduced to a fixed prompt. RL finds this strategy. Prompting cannot.
Four Tasks
Formatting → Consistency → Claim Audit → Citation Check
Easy Medium Hard Medium
| Task | What the agent does | Frontier baseline | RL target |
|---|---|---|---|
formatting_compliance |
Fix IEEE formatting violations | 0.80–0.95 | 0.95+ |
internal_consistency |
Find where paper contradicts itself | 0.40–0.65 | 0.65–0.80 |
claim_evidence_audit |
Find where text claims ≠ table values | 0.20–0.45 | 0.55–0.75 |
citation_verification |
Identify ghost and misattributed references | 0.35–0.60 | 0.65–0.80 |
Task 3's low baseline is the core RL contribution — it proves genuine training headroom exists.
Reward Design
Task 1 — Progressive Reward Shaping (PRS)
Three stages unlock sequentially. Stage N only contributes when Stage N-1 ≥ threshold. Prevents GRPO gradient collapse.
Stage 1 │ weight 0.40 │ threshold 0.00 │ Title, abstract, section headings
Stage 2 │ weight 0.35 │ threshold 0.60 │ Section order, word limits, captions
Stage 3 │ weight 0.25 │ threshold 0.70 │ IEEE citations, author block, keywords
Based on: arXiv 2512.07478 — PRS for Agentic RL
Tasks 2 & 3 — F-beta + Potential-Based Reward Shaping
F-beta (β=0.5) weights precision 4× over recall — prevents hallucination gaming:
F_β(precision=1.0, recall=0.5) = 0.833 ✓ correct and precise
F_β(precision=0.2, recall=1.0) = 0.227 ✗ spamming guesses
PBRS (Ng et al., ICML 1999) gives dense intermediate rewards on every navigation step:
Φ(s) = 0.30 × sections_read/total + 0.30 × tables_checked/total + 0.40 × claims_extracted/est
F(s,s') = γ·Φ(s') − Φ(s) ← policy-invariant, theoretically guaranteed
Curriculum — AdaRFT + UCB1
Keeps agent in productive zone (avg score 0.40–0.70). UCB1 maximises learning gradient (reward variance), not mean reward.
avg > 0.70 → select harder papers
avg < 0.40 → select easier papers
Based on: arXiv 2504.05520 — AdaRFT Adaptive Data Selection
Quick Start
Install
git clone https://github.com/Nensi1311/research-paper-formatter-agent
cd research-paper-formatter-agent
pip install -r requirements.txt
Generate corpus
python scripts/generate_corpus.py
Run tests
python tests/test_all.py
# → ALL TESTS PASSED (63/63)
Start server
uvicorn server.app:app --host 0.0.0.0 --port 7860
Test all 4 tasks — Linux/macOS
for task in formatting_compliance internal_consistency claim_evidence_audit citation_verification; do
curl -s -X POST localhost:7860/reset \
-H "Content-Type: application/json" \
-d "{\"task_id\":\"$task\"}" | python3 -c \
"import sys,json; d=json.load(sys.stdin); print('$task: OK' if 'observation' in d else '$task: FAIL')"
done
Test all 4 tasks — Windows PowerShell
foreach ($task in @("formatting_compliance","internal_consistency","claim_evidence_audit","citation_verification")) {
$body = '{"task_id":"' + $task + '"}'
$r = Invoke-RestMethod -Uri "http://localhost:7860/reset" -Method POST -ContentType "application/json" -Body $body
Write-Host "$task : OK"
}
Docker
docker build -t scholar-env .
docker run -p 7860:7860 scholar-env
curl http://localhost:7860/health
Run baseline agent
export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token"
export HF_SPACE_URL="https://flyingmaverick-scholar-env.hf.space"
python inference.py
# Writes: baseline_scores.json
API Reference
POST /reset
{"task_id": "formatting_compliance"}
Returns observation with manuscript_text, style_guide, step_count, max_steps, hint.
POST /step
Task 1 — submit formatted manuscript:
{"task": "formatting_compliance", "formatted_text": "...full reformatted manuscript..."}
Tasks 2/3 — navigate:
{"task": "claim_evidence_audit", "action_type": "query_section", "section_name": "results"}
{"task": "claim_evidence_audit", "action_type": "check_table", "table_id": "Table 1"}
{"task": "claim_evidence_audit", "action_type": "extract_claims", "section_name": "results"}
Tasks 2/3 — submit findings:
{
"task": "claim_evidence_audit",
"action_type": "submit_findings",
"findings": [
{
"type": "table_text_mismatch",
"location": "abstract",
"claim": "Table 2 shows 87% accuracy",
"contradicts": "Table 2 value is 79%",
"table_id": "Table 2",
"table_value": "79%"
}
]
}
Task 4 — check citation:
{"task": "citation_verification", "action_type": "check_citation", "citation_id": "ref_3"}
Task 4 — submit verdicts:
{
"task": "citation_verification",
"action_type": "submit_verdicts",
"verdicts": [
{"citation_id": "ref_3", "status": "ghost", "issue": "Implausible title claim", "confidence": 0.9}
]
}
Step response:
{
"observation": {...},
"reward": 0.7341,
"done": false,
"info": {"f_beta": 0.73, "precision": 0.8, "recall": 0.67}
}
Other endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | {"status":"ok","version":"0.4.0"} |
/state |
GET | Episode state, curriculum summary |
/tasks |
GET | All 4 task descriptions |
/action_space |
GET | Full action schema |
Project Structure
├── inference.py ← Baseline agent (root — required by spec)
├── models.py ← FormattingAction, ScholarAction, CitationAction
├── corpus.py ← PaperCorpus loader
├── openenv.yaml ← 4 tasks, endpoints, authors, baseline_script
├── Dockerfile
├── requirements.txt
│
├── data/
│ ├── papers/
│ │ ├── paper_001.json ← NLP benchmark (easy)
│ │ ├── paper_002.json ← CV survey (medium)
│ │ └── paper_003.json ← MTL paper (hard)
│ └── styles/ieee.yaml
│
├── server/
│ ├── app.py ← FastAPI endpoints
│ ├── environment.py ← 4-task state machine
│ ├── reward_shaper.py ← PBRS (Ng et al. 1999)
│ ├── curriculum.py ← AdaRFT + UCB1
│ ├── bandit.py ← Learning-gradient UCB1
│ ├── citation_verifier.py ← Citation parser + SQLite cache
│ └── graders/
│ ├── formatting_grader.py ← PRS 3-stage (Task 1)
│ ├── consistency_grader.py← F-beta (Task 2)
│ └── audit_grader.py ← F-beta + PBRS (Task 3)
│
├── scripts/generate_corpus.py
└── tests/test_all.py ← 63 assertions
Testing
[Corpus] 8/8 ✓
[FormattingGrader] 8/8 ✓ PRS stage locking
[ConsistencyGrader] 9/9 ✓ F-beta, hallucination penalty
[AuditGrader] 6/6 ✓ Evidence specificity, coverage bonus
[PBRS] 6/6 ✓ Potential monotonicity, bonus bounds
[UCB1 Bandit] 3/3 ✓ Learning gradient maximisation
[Curriculum] 4/4 ✓ AdaRFT productive-zone targeting
[ScholarEnvironment] 19/19 ✓ Full episode loops, all 4 tasks
Results: 63/63 passed — ALL TESTS PASSED
Research Foundation
| Paper | What it justifies |
|---|---|
| PRS · arXiv 2512.07478 | Task 1 progressive staging prevents GRPO gradient collapse |
| PBRS · Ng, Harada & Russell, ICML 1999 | Policy-invariant dense intermediate rewards |
| AdaRFT · arXiv 2504.05520 | Curriculum targeting [0.40, 0.70] productive zone |
| RLVE · arXiv 2511.07317 | Adaptive difficulty, UCB1 maximises variance |
| Veri-R1 · arXiv 2510.01932 | Online RL for claim verification is current SOTA |
| LaMer · arXiv 2512.16848 | Structured feedback improves agent 11–19% |
| StatCheck · Epskamp 2016 | ~50% of papers have errors — scale motivation |
| GROBID · Lopez 2008–2025 | Prior art; CitationVerifier is our RL-native alternative |
Authors
Nensi Pansuriya · Krushna Parmar · Ishita Bhojani
Meta × PyTorch OpenEnv Hackathon · Round 1 · April 2026