Kuan‐lin Huang

Research Statement

Kuan-Lin Huang investigates the genetic architecture of human diseases, with a primary focus on how germline and somatic variations drive cancer development and progression. In a landmark study published in Cell, Huang analyzed over 10,000 adult cancers to identify pathogenic germline variants across 33 cancer types, pinpointing specific predispositions and their molecular consequences. This work established a foundational landscape for understanding hereditary cancer risks. Huang also led research into the spatial relationships between protein phosphorylation sites and somatic mutations, demonstrating how these interactions alter cellular signaling and contribute to oncogenesis. As a senior and corresponding author, Huang has expanded these investigations into diverse clinical areas, including the genomic features that distinguish young-onset from later-onset cancers and the ancestry-specific germline variants that predispose certain populations to malignancy. His lab has also developed computational approaches to study the immune response in DNA damage repair-deficient tumors and identified proteogenomic targets in hepatocellular carcinoma. Beyond oncology, Huang has contributed to large-scale collaborative efforts characterizing the genetic risk factors for late-onset Alzheimer’s disease and the molecular correlates of genetic ancestry in cancer. His recent work utilizes machine learning and multi-omic integration to predict disease mortality and identify master regulators of protein abundance across various human tissues.

Featured Works

Kuan‐lin Huang· Icahn School of Medicine at Mount SinaiJul 18, 2026

Should an NIH-funded paper still be allowed to say: “Data available upon request”?

Our finalist solution for the NIH S-Index Challenge asks a different question: Which datasets actually enable downstream science; and how should we reward the researchers who share them?

https://theSindex.org/ Tell us: How should data sharing be measured?

thesindex.org

Open →

Kuan‐lin Huang· Icahn School of Medicine at Mount SinaiMay 28, 2026

Essay

I Re-implemented Google's AI Co-Scientist (Nature 2026) & Made it Open Source

Google's Co-Scientist paper (Gottweis et al., Nature, 2026) describes a multi-agent hypothesis-generation system: six specialized Gemini agents (Generation, Reflection, Ranking, Evolution, Proximity, Meta-review) plus a Supervisor. The team validated it wet-lab in three biomedical settings, including AML drug repurposing, liver fibrosis, anti-microbial resistance, and reported per-paper benchmarks against the SOTA reasoning models of early 2025.

This repo is an independent reimplementation. The paper's source code isn't public (?!), but the supplement contains pseudocode for each agent (reference/8 Pseudocode…) and the full prompts (reference/9 Prompts…); both were enough to rebuild the agent roster faithfully (?) What follows is a side-by-side: the design choices, the gaps relative to the paper, and what we learned from running the included co-scientist bench harness against the paper's preference-ranking comparison.

GitHub: https://github.com/Kaimen-Inc/Co-Scientist

Full Benchmark Results showcased on AI Scientist Arena

TL;DR

The agents are re-implementation of the paper's agents. Roster, prompts, debate/tournament/evolution logic all follow the supplement. The same six worker types + Supervisor; Elo-1200 init; the same meta-review→prompt-append "learning without backprop" loop.
The infrastructure is not the paper's infrastructure. The paper runs on Google internal scheduling; we built a durable SQLite-backed task queue with bounded concurrency, FAISS-backed proximity, and a provider-agnostic LLM layer so we can test more LLMs(Anthropic / OpenAI / Gemini / OpenRouter / Groq / Together / Mistral / Ollama / OpenAI-compat).
What we can't reproduce: the paper's wet-lab validation (we have no lab) and the paper's compute scale (they scale test-time compute hard; we cap per-session at a few USD by default).
What we can reproduce: the preference-ranking style comparison against the paper's frontier baselines. We added a wrinkle the paper doesn't: each model runs both through the full multi-agent pipeline and as a single raw LM call, in the same Elo pool, so the harness's contribution gets a number.
What the benchmarks actually showed (20 benches, 48 AML hypotheses total): the pipeline reliably finishes and produces a hypothesis. Whether it helps the underlying model is not reproducible at n=1 — the direct→pipeline Elo delta swings sign across reruns for at least one model. Models converge on mechanisms (LSC targeting, OXPHOS, BCL-2) but diverge wildly on which drug to propose, and none of the 48 hypotheses ever hit the paper's original picks (Nanvuranlat, KIRA6, Leflunomide).

1. The architecture we kept

The agent roster, prompts, and the control flow come straight from the paper's supplement.

                   co-scientist run "<goal>"
                              │
                              ▼
        ┌──────────────────────────────────────┐
        │            Supervisor                │  durable task queue (SQLite)
        │  • parse_goal → ResearchPlan         │  bounded concurrency
        │  • enqueue initial Generation tasks  │  lease + dead-letter + resume
        │  • main loop: claim → run → follow-up│
        │  • decide_next_steps when idle       │  termination:
        │  • finalize: meta-review overview    │   BUDGET / WALL_CLOCK / ELO_STABLE / IDLE / EXTERNAL
        └──────────────────────────────────────┘
                              │  tasks
        ┌─────────────────────┼─────────────────────────────┐
        ▼                     ▼                             ▼
   Generation  ──hyp──►  Reflection  ──review──►  Ranking ──► Elo
   literature              full +                pairwise vs
   + debate                verification          debate
        ▲                     ▲                             │
        │                     │ informative pairings        ▼
   Evolution  ◄──── Meta-review  ─────────────►  Proximity
   combine/                feedback                FAISS embed
   simplify/               + final                 + cluster /
   feasibility/            overview                dedup
   out_of_box

The agent mapping

Paper agent Our implementation Notes Generation co_scientist/agents/generation.py Two modes per the paper: literature (web search → synthesize → propose) and debate (simulated multi-turn scientific debate). Forces a record_hypothesis tool call to make output parseable across providers. Reflection co_scientist/agents/reflection.py Three review modes: review (novelty + correctness + testability + safety, with web search), verification (deep decomposition into sub-assumptions), observation (does this hypothesis explain long-tail observations?). Skips the paper's "initial review" — the paper itself notes it's mostly a cost-saving filter; with our budgets it's not a bottleneck. Ranking co_scientist/agents/ranking.py Elo-1200 init; multi-turn debate for top-ranked pairs, single-turn pairwise for the rest. Structured verdict via a forced record_verdict tool call (the paper parses text — fragile across providers; we picked the cleaner route). Evolution co_scientist/agents/evolution.py Four strategies (the paper lists six; we collapsed "inspiration from existing" + "enhancement through grounding" into our four since they share the same prompt scaffolding): combine, simplify, feasibility, out_of_box. Generates new hypotheses, never mutates existing — same as the paper, to protect tournament-tested entries. Proximity co_scientist/agents/proximity.py FAISS IndexFlatIP per session; Voyage → OpenAI → hash-fallback embedder chain. Drives dedup and informative tournament pairings (closer pairs get scheduled more often). Meta-review co_scientist/agents/metareview.py Two modes: system (synthesize patterns from reviews + match transcripts, feed back as appended prompt to all agents — the paper's "learning without backprop"), final (research overview at end of run). Supervisor co_scientist/agents/supervisor.py Parses goal → research plan, schedules worker tasks, computes summary stats periodically (Elo distribution, queue depth, agent effectiveness), decides what to schedule next, decides when to stop.

The prompts

The 14 Jinja templates in config/prompts/ are direct ports of the paper's supplementary prompts. We kept the structure verbatim (modulo Jinja interpolation for goal, hypothesis, prior context). When the paper publishes prompts in reference/9 Prompts for the specialized agents in .md, that file is the canonical reference.

Termination

The paper terminates when summary statistics suggest a "terminal state." We made that explicit — five reasons:

BUDGET — token/USD cap hit (default $2 per run)
WALL_CLOCK — deadline crossed
ELO_STABLE — top-K Elo hasn't moved more than ε over the last N matches
IDLE — empty queue + no follow-ups to schedule
EXTERNAL — user invoked pause/abort

Live in co_scientist/orchestrator/termination.py.

2. What's not the paper

Three categories.

2a. Infrastructure: we couldn't reuse theirs

Their async task framework runs on Google internal infra. Ours runs on:

SQLite + WAL with busy_timeout and an idempotent migration runner. 15 tables: sessions, hypotheses, reviews, tournament_matches, elo_journal, tasks (durable queue), transcripts, system_feedback, embeddings_meta, spans, events, bench_runs, bench_candidates, bench_matches.
A lease/dead-letter task queue with resume after crash. The paper's framework restarts cleanly from context memory; ours restarts cleanly from tasks.status = 'leased' AND lease_expires_at < now().
An event bus that fans out to SSE for the live web UI (co_scientist/web/).
A provider-agnostic LLM layer (co_scientist/llm/) — the paper is Gemini-only; we support 9 provider backends and let you pin a different model per agent role (generation, reflection, ranking_pairwise, metareview_final, etc.) via TOML. Prompt-cache breakpoints are wired up for Anthropic; reasoning effort knobs for OpenAI o-series and Anthropic thinking.

2b. Validation: we couldn't run wet lab

The paper's three real-world validations (AML drug repurposing in vitro, liver fibrosis epigenetic targets in hepatic organoids, recapitulating the cf-PICI mechanism) require a lab. We have none.

The closest in-silico proxy we built is recall against curated answer keys from the paper:

label size what it is aml-repurposing-paper-top3 3 The paper's strict methodology picks — no prior preclinical evidence, no external inputs (no DepMap, no expert curation). Nanvuranlat, KIRA6, Leflunomide. aml-repurposing-paper-5 5 The broader 5-drug list from the main text. Binimetinib, Pacritinib, Cerivastatin, Pravastatin, Dimethyl fumarate.

The matcher checks whole-token, case-insensitive against every searched field (title, summary, full text, entities, citation excerpts) of every hypothesis a candidate produces. Drug class mentions (e.g. "DHODH inhibitor") don't count — the name has to be there.

2c. Scale: we couldn't run their compute

Their Fig. 2a is over 203 research goals with the system left to scale test-time compute. Their Fig. 2b is 15 expert-curated goals each run head-to-head against five other LLMs. Our default budget is $2/session and --n 1; the full picture is in BENCH_RESULTS.md.

3. The `bench` harness — a knob the paper doesn't have

The paper compares Co-Scientist against frontier LLMs (Gemini 2 Pro/Flash, o1, o3-mini-high, DeepSeek R1) via Elo. We do the same thing — --preset paper reproduces those baselines via OpenRouter. But the paper's comparison answers "is the system better than the underlying model?" by comparing Co-Scientist (on Gemini 2.0) against plain Gemini 2.0, plain o1, etc. The harness's contribution is conflated with everything else.

Our *-vs-raw presets cleave it cleanly: same model, two modes, same Elo pool.

pipeline — model runs through the full Generation agent (literature tools + tool-use loop + dedup + forced record_hypothesis)
direct — same model, same prompt, single LM call with the same forced record_hypothesis, no tools

Round-robin pairings, structured verdict from one fixed judge model (so no candidate scores its own work), all matches recorded with rationale in bench_matches.

This is the only knob in the paper's space that we're set up to measure cleanly. So we did, twice, and the result is — instructively — noisier than I expected.

4. What 20 benches actually showed

Full table is in BENCH_RESULTS.md. The summary:

4a. The harness completes; whether it helps doesn't replicate at n=1

After the d207233 pipeline-reliability fixes, every candidate finishes pipeline mode in the recent runs. The misses are external — a transient HTTP 429 and gemini-2.5-pro intermittently returning an empty completion on the forced final call (2 of 3 attempts). So "the harness ships a hypothesis" is reliable.

The Δ-Elo from direct to pipeline is not reliable across reruns. Same preset, identical settings, two runs:

model run 1 Δ Elo run 2 Δ Elo claude-haiku-4.5 +180 (raw 1-9, pipe 10-0) −28 (raw 10-2, pipe 8-4) openai-o1 +43 +29

Haiku's raw win-loss alone flipped 1-9 → 10-2 across the two runs. With one hypothesis per candidate and ~2 matches per pair, the tournament gets dominated by which single hypothesis happened to get sampled, not by the mode. The earlier-draft headline "pipeline beats raw" was reading sampling noise.

For single-run deltas, the spread is wide and doesn't line up by provider:

model Δ Elo (single run) claude-opus-4.7 +97 gemini-2.5-flash +172 gemini-2.5-pro +47 gpt-5 +26 gemini-2.0-flash −48 gemini-3-flash −36 gemini-3-pro −89

Within Google alone the 2.5 models gain (+172, +47) and the 3.x models lose (−36, −89). The only signal we'd call repeatable is openai-o1, modestly ahead in pipeline mode across both reruns.

Practical implication: at this --n, don't read a single bench's Elo as a model verdict. To put a number on harness contribution per model needs many more seeds (higher --n, more --matches) so the per-hypothesis variance washes out. The paper sidesteps this entirely by running on 203 goals and 15 goals respectively; our $-budgeted runs can't.

4b. Models converge on mechanisms, diverge on drugs

Across all 48 AML hypotheses recorded on this codebase (every bench × every candidate × every mode):

recurring theme hypotheses (of 48) leukemic-stem-cell (LSC) targeting 28 OXPHOS / mitochondrial complex I 8 BCL-2 / MCL-1 (Venetoclax axis) 7 FLT3-ITD 6 fatty-acid oxidation 5 ferroptosis 3

These are the well-known AML vulnerability classes. The agreement is not surprising.

At the drug level it's a long tail of one-offs. Compounds proposed more than once:

Itraconazole ×5 (as an OXPHOS inhibitor)
Auranofin ×2 (thioredoxin-reductase)
Venetoclax ×6 — but always as the resistance/combo context, not the novel candidate

Every one of those three already has prior published AML evidence. The strict prompt explicitly forbids that. The models default to the familiar; "recurrence across models" is a weak novelty signal at best.

4c. None of 48 hypotheses hit the paper's original picks

0/3 on aml-repurposing-paper-top3 (Nanvuranlat, KIRA6, Leflunomide). 0/5 on aml-repurposing-paper-5 (Binimetinib, Pacritinib, Cerivastatin, Pravastatin, DMF).

The proposed drugs by our co-scientist include: bazedoxifene (GP130/IL-6), pirfenidone (TGF-β/p38), meldonium (carnitine depletion), pitavastatin (isoprenylation), brensocatib (DPP1), sitagliptin (DPP4), niclosamide (STAT3), denifanstat (FASN). Several may be pre-clinically interesting. They're just not the Nature paper's picks. With --n 1, the match is mostly a draw from a wide distribution; the paper achieves hits by running massive tournaments with iterative refinement until top-ranked hypotheses converge — that's the part our $2-per-session budget can't afford.

5. What this exercise tells us about the paper

A few things the reimplementation makes more visible:

The architecture is portable. Stripping the Gemini-only assumption out wasn't hard — the agent prompts are mostly about thinking style (debate, deep verification, out-of-box evolution) and the LLM is a substrate. Any tool-using LM with reasonable context length plugs in.

The Elo tournament is doing some work for some modlels. It's the substrate that turns single sampled hypotheses (high variance) into a stable ranking (lower variance). At our scale the tournament is too thin to stabilize for all models, but may not improve all models (like gemini 3 series), pending more testing.

What is the gold standard? Even strong models hit 0/3 on the strict top-3 list proposed by the original paper, because "find the same novel drug a paper found" is a sample-of-one match against a high-cardinality distribution. The paper avoided this by generating thousands of candidates and ranking, then letting expert oncologists pick the top 30 → 5. Mechanism-class recall (LSC, OXPHOS, fatty-acid oxidation) seems to be a better fit and it may have hold up. Cancer has been cured in vitro many times, whether these are actually good drugs remained to be tested...

6. Where to look in this repo

README.md — install, run, configure
docs/BENCH_RESULTS.md — auto-generated, every bench's per-candidate Elo, every hypothesis, original paper hits, file pointers
co_scientist/agents/ — the seven agents
config/prompts/ — the 14 Jinja prompts derived from the paper's supplement
reference/ — paper source materials (pseudocode, prompts, diagrams), gitignored
co_scientist/bench/ — the head-to-head harness, original paper hit scorer, presets
scripts/build_bench_report.py — regenerate BENCH_RESULTS.md from SQLite after a new bench

Apache-2.0. Independent reimplementation. I'm not affiliated with Google or the paper's authors & thank them for the initial contribution. GitHub: https://github.com/Kaimen-Inc/Co-Scientist

Continue reading →

11 min read

475

Kuan‐lin Huang· Icahn School of Medicine at Mount SinaiMay 20, 2026

Once you claim your scholarly account with matched name here, JRNLClub AI will help you generate an accurate CV in a minute that source all your papers! https://www.youtube.com/watch?v=89fq9jeLr3I

Watch on YouTube

156

Kuan‐lin Huang· Icahn School of Medicine at Mount SinaiMay 19, 2026

Essay

How I won the NIH replication prize by using AI to validate drug targets at scale

About 90% of cancer drug candidates that enter clinical trials never make it to approval. A big chunk of that failure is upstream: the target was wrong. Two industry audits made this concrete years ago. Bayer reported in 2011 that only 20–25% of published cancer targets held up when their own scientists tried to reproduce them; Amgen in 2012 said just 6 out of 53 "landmark" oncology studies survived rigorous replication. We've known this for a long time. We just haven't had a way to do something about it at scale (at least in the published literature).

Manually re-validating every published target is tedious. You'd need to harmonize lots of CRISPR, omics, and other data, work out the right disease subgroupings, write the codes, run the stats, look at the output. Each target takes days to validate. Nobody's funded to do it (in academia). So most candidates sit there, cited, repeated, occasionally bankrolled into a screen.

So I tried something else because it's 2025 (when this was done). I gave the job to an AI agent (Biomni) and ran 31 published oncology targets through it in an afternoon. The compute cost $68 in Claude API credits. About two-thirds of the retracted-paper targets failed to replicate. Roughly two-thirds of the recent, non-retracted targets did. Compared to retracted ones, the non-retracted targets have a 17 O.R. to show bona-fide, context-specific dependency in the agent's analyses that I validated as correct.

The interesting part isn't the headline number. It's how to get an agent to do this kind of work without it making things up.

1. Find out what the agent can do reliably

Most of the hype around "AI scientists" frames the agent as a generalist that does everything. That's a trap. LLMs hallucinate, especially when asked to use tools or data that they either don't have access or know how to use. But they will almost always write you a beautiful, plausible, partly-wrong narrative.

The move is to find a task class where the agent is reliable, say, above 95% success rate on something you can score. For me that task is: given a gene target, a disease context, and a public dataset like DepMap or TCGA, test whether the gene shows context-specific cancer dependency. Narrow enough that the agent's job is mostly translating a hypothesis into code and stats. Reliable enough that I can trust the agent's executions.

2. Apply it across many use cases

Once you know the agent does one type of thing well, throw a lot of that thing at it. I built a table of 31 targets: 17 from retracted papers, 14 recent candidates with real-looking evidence. Each verbal target claim got translated into a structured natural language prompt with the same template. Gene, context, datasets to use, statistical contrasts to run.

When I first started playing with the agent, the biggest failure mode wasn't bad reasoning. It was the agent failing to gain access or download the right data files. Then it'd start hallucinating or simulating fake data for analyses. To stop this, I wrote a separate cancer-omics data know-how document that spelled out how to pull DepMap through the Bioconductor depmap package and how to grab TCGA Pan-Cancer Atlas data from the NCI Genomic Data Commons. This was before Anthropic released the Skills feature; today you'd just package it as a skill. Once the agent stopped fighting the data layer, the rest of the work got dramatically easier.

Two more constraints made the difference:

Forbid the agent from reading literature. I appended a non-overridable instruction: "You are a data-only replication agent. Do not use any literature search, papers, or external textual knowledge." Without that, the agent fills in gaps from training data, which means it tells you the consensus view of whatever paper it dimly remembers. You want what the data says.
Force everything into executable code. No prose conclusions. Every claim has to come from a notebook cell that loaded real data and ran a real test for me to review.

3. Validate the process before you trust the results

Before I believed anything the agent said about retracted targets, I needed proof it could find the real ones. So I seeded the panel with well-established synthetic lethal relationships: WRN in microsatellite-unstable tumors, PRMT5 in MTAP-deleted cancers.

The agent successfully re-derived the MTAP–PRMT5 relationship in detail. It stratified cell lines by copy number using a sensible 15% threshold it picked itself, compared dependency between groups, ran the dose-response across copy-number quartiles, and landed on effect sizes consistent with the literature and p-values from 10⁻⁹ to 10⁻¹¹. Once those controls worked, the rest of the panel became interpretable.

4. Look at every output myself

This is the unglamorous part nobody talks about. The agent produces 31 python notebooks. A human has to read it to validate and learn what happened. Did the data actually load? Did the statistical test make sense for the question? Did the agent silently swap in a different dataset when the first one failed? Did it interpret "wild type" the same way you meant?

I scored every one of the 31 notebooks manually. There are few components that was false after doing the aforementioned steps. The rest I coded supported, refuted, or inconclusive on two axes: context-specific dependency, and other supporting evidence.

Expert review isn't optional. The good news: it's faster than doing the analysis yourself. Maybe 15 minutes per notebook, against the several days it would take from scratch.

The most interesting result wasn't the big retracted-versus-non-retracted split. It was ALKBH5. The original paper was retracted, and the specific mechanistic claim (that miR-193a-3p regulates AKT2 through ALKBH5) didn't hold up. But the agent independently found that ALKBH5 itself is a real, glioma-selective dependency, with consistent CRISPR and RNAi signals, a strong correlation with stemness scores, a very strong negative correlation with the m6A gene signature, and a significant survival hazard ratio across gliomas.

You get insights like this because the agent decomposed the target claim into testable pieces and ran each one independently. That's the part I didn't expect, and it's the part that's made me think this approach generalizes well beyond target replication.

On AI Scientist Arena (aiscientistarena.com), I've benchmark LLMs and even without any sophisticated tool use or harness, they could predict clinical trial success beyond noise. If AI agents continue to improve in their capacity in all tasks across the drug discovery and development cycle, the best constructor of an entire clinical program might end up being an AI.

All of this — the prompts, the data and replication know-how documents, the 31 notebooks, the expert scoring — is at github.com/Huang-lab/AgentReplication. The bioRxiv preprint is at Agent-Driven Validation of Oncology Therapeutic Targets. This is part of the work that initiated the Accelerated Discovery with Agents (ADA) Consortium.

There's a version of this work that sounds bigger than it is. "AI agent validates 31 cancer drug targets in one hour" is technically true and somewhat misleading. The hour is the agent's compute time. Building the prompts, curating the targets, writing the know-how documents, and reviewing every notebook took weeks. The agent isn't doing the science. It's doing the implementation.

The science is still in deciding what to ask and whether the answer means anything to benefit humans.

Postscript, May 2026: This was my Track 2 submission to the NIH Replication Prize that was done in Nov 2025, which I thought was the better entry. My other entry, proposing mandatory release of participant-level clinical trial data, won Track 1.

Continue reading →

6 min read

381

Kuan‐lin Huang· Icahn School of Medicine at Mount SinaiMay 15, 2026

Essay

How I rebuilt Variant Effect Predictor to be 100x faster (fastVEP!)

Watch on YouTube

If you work with genomic variants, you know VEP. Ensembl's Variant Effect Predictor is the standard tool — the thing your pipeline calls to figure out whether a given mutation breaks a protein, hits a splice site, or sits harmlessly in some intron. It's been around forever and it works. It's also written in Perl, ships with a Perl 5.22+ requirement, ten-plus CPAN modules, a DBI dependency, and a small graveyard of installation issues anyone who's set up VEP from scratch will recognize.

The annotation itself is fine. The speed is not. Annotating 50,000 variants with VEP takes about 206 seconds. Point it at a full human WGS (~4 million variants) and it doesn't finish on the newest MacBook Pro. People work around this by splitting their VCFs, running parallel processes, and stitching the outputs back together. That works, but it's a huge time tax. A lab running thousands of samples pays that tax every day.

So I rebuilt it in Rust.

The numbers

fastVEP runs the same 50,000-variant file in 1.59 seconds. That's a 130x speedup. The full WGS that VEP can't finish? fastVEP does it in 86 seconds.

Peak memory drops from ~500 MB to 2.8 MB. The installed binary is 3.3 MB instead of ~200 MB of Perl plus dependencies. There are no CPAN modules to chase. You cargo install, you run a binary, that's it.

That's the headline. The interesting part is what actually made it fast. It wasn't one thing. It was the dumb stuff Perl couldn't do well, layered on top of a few good ideas.

What Rust gets you for free

A lot of the speedup is just what you get when you stop paying for an interpreter and a garbage-collected dynamic language. Tight loops over variant records compile to real machine code. Strings don't allocate when they don't need to. Parallelism is rayon and works; you don't fork ten Perl processes and reconstitute their output.

Thanks to agentic coding, doing this manageable with one person's effort for a full month. This involves knowing exactly how the algorithm works to instruct the coding agents, and verify extensively with tests and outputs. Mostly, the Sequence Ontology has 49 consequence terms; you map a variant's coordinates against a transcript and figure out which ones apply. The bottleneck in the Perl version is the Perl, not the algorithm.

If you stop there, you get maybe 10–20x. The rest came from somewhere else.

The next real win: rebuilding the annotation lookup

VEP's slowest path is annotation lookup: pulling in ClinVar, gnomAD, dbSNP, COSMIC, all the supplementary databases that turn raw consequence into something a clinician can act on. The default workflow round-trips through SQLite or remote APIs. For a million variants, that's a million lookups, and every one of them costs more than the consequence prediction itself.

The fix is to put the annotations in a format designed for the access pattern. fastVEP has its own binary format called fastSA, and the v2 design is shamelessly inspired by echtvar: thanks to Brent Pedersen's work & credit where it's due. The key improvements in my understanding:

Chunked ZIP layout with Var32 encoding for variant keys.
Parallel u32 value arrays per annotation field.
Delta encoding on sorted positions.
An LRU chunk cache, because variant lookups in a real VCF are clustered.
A Bloom filter in front of the index for negative lookups.

Putting ClinVar, gnomAD, and dbSNP into this format and querying them as a single in-process call is most of what closes the gap on the heaviest workloads. You're not asking a database anymore. You're doing memory-mapped byte arithmetic.

What surprised me

A few things I didn't expect going in.

The FASTA handling matters more than I thought. You need the reference sequence for HGVS notation, and a naïve read of the GRCh38 primary assembly is enough to wreck your memory budget on its own. Memory-mapping the indexed FASTA and pulling spans on demand was the difference between "fastVEP runs on a laptop" and "fastVEP needs a server." Apparent simplicity hides this kind of thing; samtools faidx is doing a lot of work for you.

Structural variants are genuinely separate code. SNVs and short indels share a clean abstraction. <DEL>, <DUP>, <INV>, <BND> and the rest don't slot into it cleanly. I tried for a while to unify them, eventually gave up, and wrote a separate SV consequence predictor.

HGVS was the worst part. Generating correct HGVSc and HGVSp notation with 3' normalization across all the edge cases — overlapping CDS, mitochondrial circular coordinates, start-loss variants in non-Met-starting transcripts — required more test cases than the consequence engine itself. There's a reason VEP has been worked on for a decade. The annoying details are plenty and real.

Correctness

A faster but wrongly annotated VCF isn't useful. fastVEP is validated against VEP's output on shared test sets and matches on the consequences that matter. The repo has 233 tests across the workspace, not because that number is magic, but because every annoying HGVS edge case eventually became one. If you find a case where fastVEP disagrees with VEP and you think VEP is right, open an issue. Let me know here!

Try it

It's on GitHub at Huang-lab/fastVEP, Apache 2.0. There's a hosted web version at fastVEP.org if you want to paste in some VCF and see what it does. If you have Rust installed, it's a single cargo install away.

It works on yeast, fly, arabidopsis, mouse, human, anything with a GFF3. The web server can switch between organisms if you point it at a directory of them. The preprint is on bioRxiv. If it saves your group some compute time, that's the point and I'm glad :) Watch on YouTube

Continue reading →

4 min read

244

Kuan‐lin Huang· Icahn School of Medicine at Mount SinaiMay 15, 2026

Repost

Checkout the JRNLClub demo to see what you can do here: https://youtu.be/tc_tdoC9LpI?si=1qtEiZ5pRpUEIL2t

Reposted

Kuan‐lin Huang· Icahn School of Medicine at Mount SinaiMay 14, 2026

Hello World, JRNLClub!

Open →

173

Kuan‐lin Huang· Icahn School of Medicine at Mount SinaiMay 14, 2026

Hello World, JRNLClub!

151

End of posts.

Theme 1

Identifying Pathogenic Variants in Cancer and Complex Diseases

Pathogenic Germline Variants in 10,389 Adult Cancers
901cited
Kuan-Lin Huang, R Jay Mashl, Yige Wu, Deborah I Ritter, Jiayin Wang, Clara Oh, Marta Paczkowska, Sheila Reynolds, Matthew A Wyczalkowski, Ninad Oak, Adam D Scott, Michal Krassowski, Andrew D Cherniack, Kathleen E Houlahan, Reyka Jayasinghe, Liang-Bo Wang, Daniel Cui Zhou, Di Liu, Song Cao, Young Won Kim, Amanda Koire, Joshua F McMichael, Vishwanathan Hucthagowder, Tae-Beom Kim, Abigail Hahn, Chen Wang, Michael D McLellan, Fahd Al-Mulla, Kimberly J Johnson, Cancer Genome Atlas Research Network, Olivier Lichtarge, Paul C Boutros, Benjamin Raphael, Alexander J Lazar, Wei Zhang, Michael C Wendl, Ramaswamy Govindan, Sanjay Jain, David Wheeler, Shashikant Kulkarni, John F Dipersio, Jüri Reimand, Funda Meric-Bernstam, Ken Chen, Ilya Shmulevich, Sharon E Plon, Feng Chen, Li Ding
Cell2018
Ancestry-specific predisposing germline variants in cancer
60cited
Ninad Oak, Andrew D Cherniack, R Jay Mashl, TCGA Analysis Network, Fred R Hirsch, Li Ding, Rameen Beroukhim, Zeynep H Gümüş, Sharon E Plon, Kuan-Lin Huang
Genome Medicine2020
Comprehensive Analysis of Genetic Ancestry and Its Molecular Correlates in Cancer
335cited
Jian Carrot-Zhang, Nyasha Chambwe, Jeffrey S Damrauer, Theo A Knijnenburg, A Gordon Robertson, Christina Yau, Wanding Zhou, Ashton C Berger, Kuan-Lin Huang, Justin Y Newberg, R Jay Mashl, Alessandro Romanel, Rosalyn W Sayaman, Francesca Demichelis, Ina Felau, Garrett M Frampton, Seunghun Han, Katherine A Hoadley, Anab Kemal, Peter W Laird, Alexander J Lazar, Xiuning Le, Ninad Oak, Hui Shen, Christopher K Wong, Jean C Zenklusen, Elad Ziv, Cancer Genome Atlas Analysis Network, Andrew D Cherniack, Rameen Beroukhim
Cancer Cell2020
Non-cancer-related pathogenic germline variants and expression consequences in ten-thousand cancer genomes
9cited
Zishan Wang, Xiao Fan, Yufeng Shen, Meghana S Pagadala, Rebecca Signer, Kamil J Cygan, William G Fairbrother, Hannah Carter, Wendy K Chung, Kuan-Lin Huang
Genome Medicine2021
Machine learning–based penetrance of genetic variants
21cited
Iain S Forrest, Ha My T Vy, Ghislain Rocheleau, Daniel M Jordan, Ben O Petrazzini, Girish N Nadkarni, Judy H Cho, Mythily Ganapathi, Kuan-Lin Huang, Wendy K Chung, Ron Do
Science2025

Theme 2

Proteogenomics to Link Genome Alterations to Treatments

Precision proteogenomics reveals pan-cancer impact of germline variants
8cited
Fernanda Martins Rodrigues, Nadezhda V Terekhanova, Kathleen J Imbach, Karl R Clauser, Myvizhi Esai Selvan, Isabel Mendizabal, Yifat Geffen, Yo Akiyama, Myranda Maynard, Tomer M Yaron, Yize Li, Song Cao, Erik P Storrs, Olivia S Gonda, Adrian Gaite-Reguero, Akshay Govindan, Emily A Kawaler, Matthew A Wyczalkowski, Robert J Klein, Berk Turhan, Karsten Krug, D R Mani, Felipe da Veiga Leprevost, Alexey I Nesvizhskii, Steven A Carr, David Fenyö, Michael A Gillette, Antonio Colaprico, Antonio Iavarone, Ana I Robles, Kuan-Lin Huang

Theme 3

Developing Publicly Available Bioinformatic Tools

CharGer: clinical Characterization of Germline variants
60cited
Adam D Scott, Kuan-Lin Huang, Amila Weerasinghe, R Jay Mashl, Qingsong Gao, Fernanda Martins Rodrigues, Matthew A Wyczalkowski, Li Ding
Bioinformatics2018
fastVEP: A Fast, Comprehensive Variant Effect Predictor Written in Rust
1cited
Kuan‐lin Huang
bioRxiv (Cold Spring Harbor Laboratory)2026
RastQC: High-Performance Sequencing Quality Control Written in Rust

Theme 4

AI/ML Prediction of Patient Outcomes

Machine learning models identify predictive features of patient mortality across dementia types
17cited
Jimmy Zhang, Luo Song, Zachary Miller, Kwun C G Chan, Kuan-Lin Huang
Communications Medicine2024
Machine learning–based penetrance of genetic variants
21cited
Iain S Forrest, Ha My T Vy, Ghislain Rocheleau, Daniel M Jordan, Ben O Petrazzini, Girish N Nadkarni, Judy H Cho, Mythily Ganapathi, Kuan-Lin Huang, Wendy K Chung, Ron Do

Research Statement

Featured Works

I Re-implemented Google's AI Co-Scientist (Nature 2026) & Made it Open Source

TL;DR

1. The architecture we kept

The agent mapping

The prompts

Termination

2. What's not the paper

2a. Infrastructure: we couldn't reuse theirs

2b. Validation: we couldn't run wet lab

2c. Scale: we couldn't run their compute

3. The bench harness — a knob the paper doesn't have

4. What 20 benches actually showed

4a. The harness completes; whether it helps doesn't replicate at n=1

4b. Models converge on mechanisms, diverge on drugs

4c. None of 48 hypotheses hit the paper's original picks

5. What this exercise tells us about the paper

6. Where to look in this repo

How I won the NIH replication prize by using AI to validate drug targets at scale

1. Find out what the agent can do reliably

2. Apply it across many use cases

3. Validate the process before you trust the results

4. Look at every output myself

How I rebuilt Variant Effect Predictor to be 100x faster (fastVEP!)

The numbers

What Rust gets you for free

The next real win: rebuilding the annotation lookup

What surprised me

Correctness

Try it

Activity from Alpha1

Research Areas

Affiliations & Education

Education

Employment

Research Funding

Authored Papers

Recent Quick Takes

3. The `bench` harness — a knob the paper doesn't have