AIxBio Hackathon 2026 - Project Catalog

Fragment Assembly Risk Scorer (FARS): Empirical Characterization of Split-Order Detection Boundaries for Benchtop DNA Synthesizers

Cross-Track Winner Track 4 Winner Track 1 Spotlight Track 4 Spotlight View PDF

Team: Hritika Chaturvedi | Reviews: 2

Tracks: Track 4

Benchtop DNA synthesizers have no mandatory screening today — and existing tools have a structural blind spot: they evaluate DNA fragments individually, missing split-order attacks where multiple innocuous-looking fragments assemble into a sequence of concern. We present the Fragment Assembly Risk Scorer (FARS), a prototype on-device detector that scores orders by collective assembly potential rather than individual fragment identity. Tested against 960 simulated orders across three real 1918 H1N1 genomic segments from NCBI GenBank, FARS achieves 100% detection of high- and medium-coverage split orders with zero false positives. Most importantly, on partial split orders, FARS detects 80% compared to 40% for an individual-sequence baseline modeled on IBBIS commec's methodology, doubling detection while eliminating false positives entirely. The 20% that evade local detection define a precise empirical boundary motivating shared cross-device infrastructure. Our head-to-head comparison is the first empirical quantification of what assembly-awareness buys in split-order detection on real genomic data — and directly characterizes the gap left by S.3741's mandated screening approach.

Show Reviews (2)

Review 1

I liked the well-defined scope and clear result of this submission. While a discussion of the limitations is present, I would have appreciated a more in-depth discussion on the implications of these. For example, the limitation of AI-enabled design is flagged but not really grappled with. For example, the results section could have discussed what the detection boundary would look like against codon-optimized or AI-designed sequences. Additionally, while a great technical paper, the submission could have benefited from a summary in simple terms so that educated people in the field with no specific knowledge of synthesis screening could better grasp the problem and the corresponding solution.

Review 2

Clean writeup, single question asked and answered, impressive for a solo project. 80/40 table lands the point fast, and the policy hook (33%/11% as the line where local detection breaks) is a smart frame. Weakness: It's all simulated orders + only point mutations as the evasion model. You flag both honestly.

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

Cross-Track Winner Track 1 Winner View PDF

Team: Nathan Khosla, Aleksandra Wosztyl | Reviews: 3

Tracks: Track 1 Track 4

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Show Reviews (3)

Review 1

The most interesting finding here (AI-designed Munc13 binders accidentally resembling neurotoxins with no toxin input) is buried at the end. Should probably lead with it. 24/24 catch rate via DALI is great but presentation is a weak spot: figure wrapping in related work breaks the read, and the 20-page appendix has no summary. This is honest about false positives.

Review 2

Functional/structural screening is very important and it seems you made real progress toward that, kudos! I think using ricin as a central example makes sense given that it's a favourite among would-be-terrorists. However, I wonder whether there is some cherry-picking where certain proteins perform unusually well in your pipeline, maybe because they are well-represented in the dataset or similar - whereas your pipeline would struggle to detect functional/structural analogues in other proteins (?). This is just a guess, I am not an expert in this specific area. It would also be interesting how this pipeline performs on viral proteins such as coronavirus family spike proteins or influenza H/N proteins. More infohazardous of course! "a risk acknowledged by the October 2026 revision to the OSTP Nucleic Acid Synthesis Screening Framework, among other policy memos" --> either a typo for the year or LLM hallucination. You mention the pipeline is bottlenecked by compute resources for use at scale. I'd be curious to see cost and speed comparisons with sequence homology screening. This is asking too much for a hackathon submission but if this was a full article I would like an explanation of why exactly these values were chosen and what they actually represent: "DALI similarity scoring was determined if a protein matched a toxin with a z score greater than 10, an identity score greater than 30%, and/or an RMSD less than 2. MMseqs2 similarity was determined if the identity score was greater than 90% than a known toxin sequence. FoldSeek similarity was determined by an identity score greater than 30%, RMSD below 2, e-value below 1e-4, and/or TM score greater than 0.9." The Limitations and Future Work section makes lots of good points. I would have appreciated a short discussion on next steps of how false positives and false negatives in your pipeline could be reduced. IMHO your submission is a bit too biased toward arxiv-style academic writing. That's great for a very particular researcher audience but as a judge who is more in policy-world, it's a bit hard to follow. I think you could take inspiration from the writing of research outputs at places like METR or GovAI who do a great job at writing with rigorous clarity that is still accessible to non-experts. Hence slightly lower points for presentation/clarity but your technical contribution is fantastic.

Review 3

An impressive amount of work covering a known gap in biosecurity. While this is similar work going on in this area, this is a nice piece of work that strings together several public tools into a robust demonstration and proof of principle. The paper is clearly written with a strong ToC and the table is nice and very information dense. A small legend on the second table in the appendix indicating what the colors meant would be nice to have, but minor. For a weekend hackathon, the breadth of the testing dataset is very impressive. Testing against a panel that includes natural toxins, benign structural mimics, amyloid proteins, de novo ricin variants, and AI-designed Munc13 binders demonstrates a deep understanding of the problem space and a very rigorous examination of the screening pipeline. The truncated de novo ricin sequences to simulate a partial-synthesis evasion attack was a strong inclusion. Proving that sequence alignment caught 0/24 of these while your DALI pipeline caught 24/24 is a great way to highlight the importance of moving beyond homology methods. You do well in discussing the false positives and current limitations, flagging that host proteins co-crystalized with toxins get called, and you lay out future directions clearly and with a very strong understanding of the problem space and ongoing work. The tiered design is clever. Using MMseqs2 for fast, low compute initial gating before moving to computationally expensive tools like AlphaFold2, FoldSeek, and DALI is practical and logical system design. But, and you touch on this, running AlphaFold2 and DALI on every single sequence that bypasses MMseqs2 is currently too computationally expensive for commercial scale adoption. But that may just be a problem that resolves as compute gets cheaper. "Finally, trypsin (14), concanavalin A (19), and thrombin (22) were confidently flagged as toxic across all methods, raising a broader question that this and any biosecurity screening tool must explicitly address: what threshold of danger justifies synthesis restriction? The distinction between "toxic in some biological context" and "dangerous enough to warrant screening" is not currently defined in any of the databases used here." - Excellent point. These aren't really biosecurity threats, and a smart synthesis screening system should not flag them, but I agree that is broader work and a bigger question than within this project and for a hackathon. Future work could focus on a few areas. To solve the compute bottleneck, potentially inserting an intermediate machine learning step between MMseqs2 and AlphaFold2. As you mentioned in the report, using a lightweight classifier built on embeddings from a protein language model (like evo2 or ESM-2) could quickly filter out non-homologous benign proteins, reserving AlphaFold2 strictly for highly suspicious sequences. A good path toe xplore further. Moving away from binary "toxic/non-toxic" outputs. Work developing a preliminary heuristic or risk score that considers the exact structural hit and flagging an active site match for botulinum neurotoxin as a 'critical flag/block' while flagging an overexpressed protease active site as 'requires manual/human review' mimicking some current screening approaches. As LLMs improve one could also think of sending it for a review step to one to then send their summary along to a human reviewer. In future presentations, include a 3D structural overlay (e.g., in PyMOL) that shows how your de novo ricin aligns with the active site of natural ricin despite zero sequence homology. Visuals make structural bioinformatics much more accessible to policymakers, the public, and generalist judges etc. Overall, really great work!

Track Winners

OmnyraCloud

Track 3 Winner View PDF

Team: Emilin Mathew | Reviews: 3

Tracks: Track 3

OmnyraCloud is protocol level biosecurity screening for cloud lab workflows. Today's tools screen DNA sequences at order time — but cloud labs run workflows, and a chain of individually benign steps (serial passage, split orders, surface obfuscation) can pursue a dangerous objective without a single flagged sequence. That's the gap we close. OmnyraCloud ingests any lab protocol (Autoprotocol, Opentrons, JSON, or free text) and runs a 5-stage pipeline: decompose the workflow → score 5 risk dimensions → ground every flag in retrieved biosecurity literature → audit with LLM-as-judge → cross-check sequences via IBBIS commec. Output: an auditable risk report with citations, not a black box. IBBIS flagged 1/3 dangerous protocols. Two sequences were screened but returned no HMM matches, but protocol level reasoning caught all three. Protocol level screening isn't just complementary to sequence screening. It's essential. Live at https://omnyra-cloud.vercel.app/

Show Reviews (3)

Review 1

Fantastic idea. Best of luck with Omnyra. As you said, the data size was sufficient for a proof-of-concept. This could have actual legs once a statistically significant number of protocols are evaluated.

Review 2

I'm not convinced that cloud labs are (as yet) a significant area of risk, but this tool might also be useful to CROs. The way that the tool is constructed seems robust and comprehensive, though. Also, just a small note, the table and graph says 'H7N9' but it should be 'H5N1'.

Review 3

The threat model is formally defined, the problem is real and underaddressed, and the retrieval-grounded reasoning chain is exactly the right design choice for a tool that needs to be auditable by human biosafety reviewers. The evaluation is the ceiling. Five protocols; three of the most canonical DURC examples in the literature, two of the most obviously benign controls, is the easiest possible test set. Perfect F1=1.0 on that basis proves the system works in principle, not that it works in practice. The harder questions are false positive rate on ambiguous legitimate research and false negative rate on novel techniques outside the current threat taxonomy. Neither is tested, and both matter more for deployment than catching H5N1 and SARS-CoV-2. The multi-protocol correlation gap is worth flagging beyond future work. An adversary who knows protocols are evaluated in isolation will simply split a dangerous workflow across multiple low-risk submissions. That's not a limitation to address later, it's a fundamental constraint on the current system's deployment value that should be stated more prominently. Solo project, weekend build, live deployment, formally grounded threat model. This is the kind of work that should be developed further.

#11

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Track 3 Winner View PDF

Team: Victor Wong, Avigya Paudel, Syed Mahir Ahamed, Viet Minh Hieu Nguyen | Reviews: 2

Tracks: Track 3

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Show Reviews (2)

Review 1

- Explain why you think we need a bio-specific jailbreaking benchmark? Can't we assume it mostly matches the ASRs on general jailbreak benchmarks? Is there anything that makes us assume it might be different for bio than for other topics? - You could keep the introduction more crisp. The first half-page on general AIxBio advances and policy doesn't add a lot. - A comparison from your results to the results from the Biothreat Benchmark Generation framework and other jailbreaking results would have been interesting to highlight the marginal value add of your work. - Overall cool project, clean implementation and good writing, good job guys :)

Review 2

Tests whether bio-misuse safeguards in frontier LLMs actually hold under adversarial pressure, using four attack methods against four models across 40 bio-specific prompts. Responses are scored on three dimensions: whether the model refused, how technically specific the response was, and whether it was structured as something someone could follow. Results show a stark divide: Claude and GPT maintain meaningful safeguards; DeepSeek and Kimi barely refuse even direct bio-misuse requests. Why it matters: We've had benchmarks measuring what models know about biology and generic jailbreak benchmarks that treat bio as one category among many. What we haven't had is a systematic test of whether bio-specific safeguards hold when someone actively tries to get around them. The finding that Crescendo (gradual conversational escalation) is the most effective attack (and produces the most actionable responses, not just the most responses) has direct implications for how we think about multi-turn risk. What's strong: The specificity/actionability scoring split is the right design for biosecurity (i.e., knowing what reagents to use is a different failure than knowing how to combine them). The prompt construction methodology is transparent and thoughtful. The rubric could serve as a standalone reference for future work. The Western/Chinese model comparison is policy-relevant. What's missing: The automated judge hasn't been validated against human experts. DeepSeek is both a target model and the judge, which introduces self-scoring bias. Kimi is both a target and the attacker in iterative attacks. The prompt set is small and deliberately selected to elicit failures — reported refusal rates shouldn't be read as base rates.

#27

Hydra Watch: Federated wastewater pathogen surveillance with foundation-model embeddings

Track 2 Winner View PDF

Team: Divya Sitani, Mohammed ElSayed, Frida Arrey, Hanna Schutz, Sascha Held | Reviews: 3

Tracks: Track 2

HydraWatch: Embedding-based wastewater pathogen surveillance for federated hospital networks A reference-free, privacy-preserving wastewater pathogen surveillance pipeline for federated hospital networks. Each hospital sequences its own sewershed, embeds reads with DNABERT-2 (768-dim), and trains a local TE-VAE (Transformer-encoder VAE) on the classified pool to define "site-normal." A hybrid score (reconstruction error plus latent Mahalanobis) flags anomalous reads in the unclassified pool, the blind spot where novel pathogens hide because reference-based tools like Kraken2 can't see them. Anomalies are clustered with HDBSCAN and tracked across timepoints. Trajectory analysis flags four patterns: emerging (rising over time, including signals that appear only at the latest timepoint), persistent, transient, and declining. The early-warning signal is anything new or accelerating. Cross-site detection happens by query, not data: a hospital sends a single 768-dim cluster centroid (around 3 KB) to peer sites, who match locally and reply. Raw reads and read-level embeddings never leave the site, sidestepping the data-sharing agreements that typically slow multi-site surveillance. Pilot: three timepoints from one NY hospital sewershed (CASPER PRJNA1247874). One dominant emerging cluster (cluster 6) grew from 285 reads at T1 to 3,506 reads at T3 (×12.3 growth), driving the early-warning signal. BLAST anchoring of representative reads is queued. Scaling: 5 NY hospitals, then Northeast region, then CDC. Same cluster signature at multiple sites equals outbreak signal, ideally surfaced before clinical case counts rise. Stack: DNABERT-2(Hugging Face + PyTorch), TE-VAE (TensorFlow), HDBSCAN, BLAST. ESM-2 multi-view piloted on one timepoint; METAGENE-1 is a clean upgrade path.

Show Reviews (3)

Review 1

Impressive hackathon submission with high potential impact when fully developed and deployed. Submission includes a design, prototype, and proof-of-concept for federated detection of pathogen incidence dynamics via foundation model embeddings of hospital wastewater sequences. Only post-embedding centroid clusters are shared between hospitals, allowing for quick data sharing and tracking of pathogen spread, avoiding delays stemming from data privacy regulations. Additionally, foundation model embedding clusters can identify incidence dynamics of "unclassified" potential pathogens for which reference sequences do not yet exist in current database-based pathogen monitoring systems. The included code repository with submission slide deck help clearly explain the project and results. The gap that is filled by this technology is clear.

Review 2

Reference-free anomaly detection on the Kraken2-unclassified pool (the actual surveillance blind spot) combined with a federated-by-query architecture that never moves raw reads is the most technically sophisticated work in the batch. The ×12.3 growth in cluster 6 is a compelling signal, but the load-bearing next step is BLAST anchoring: an embedding-space trajectory without biological identity is a watch-list item, not a confirmed finding. Swap in METAGENE-1, get the BLAST results, and run one real two-site centroid-query exchange to activate the architecture you've designed.

Review 3

I thought the authors could benefit from demonstrating more clearly why the gap of unclassified samples matters from a biosecurity stand-point. The authors note that approaches like Kraken-2 "works very well for organisms that already have well-sequenced close relatives in the reference database — known pathogens, well-studied commensals, common viruses. It works poorly for everything else." It's not obvious to me what that "everything else" is that presents a significant concern, particularly when the authors acknowledge the approach doesn't identify any specific organism. Even with the comparison against baseline rates, seems like that would potentially introduce a variety of false positives, since it may simply be picking up some new, non-pathogenic bacteria. I also was a bit confused by the pilot. My understanding is the hackathon is effectively for only a weekend; so how did the authors have time to conduct a pilot study over a couple months? I did appreciate the author's discussion and inclusion of implementation details on scaling up multiple levels of surveiliance.

#40

PandemicWatch - Earlier Than The News

Track 2 Winner View PDF

Tracks: Track 1 Track 4

Benchtop DNA synthesizers have democratized sequence generation, bypassing the traditional biosecurity chokepoint of centralized synthesis facilities; however, manufacturing synthesis hardware remains prohibitively difficult for isolated bad actors, so regulating the software on such machines, made by a few companies, remains a promising avenue towards security. We present SynthShield, a pre-screening and logging software that leverages ESM-2 protein language model embeddings to prevent bad actors from synthesizing such malicious sequences. Presently, screening solutions (BLAST, etc.) only compare sequence similarity, missing functionally similar but physically distinct dangerous sequences, a growing concern in the new AI-assisted research regime that allows small teams of bad actors to iterate faster and with greater available attack surface. Furthermore, current software has little public or even governmental observability for synthesized sequences, as protecting researchers' IP remains an important challenge. To tackle this, we pair our screening software with a tamper-evident black box and a public blockchain (Ethereum L2) to record sequence-aware hashes of synthesized DNA, which allows post-hoc identification of bad actors by law enforcement, even for CRISPR-altered bio-weapons that were indirectly created with the help of these synthesizers, without revealing critical information about the precise sequence. In this paper, we evaluate our integrated pipeline and show that the ESM-2 screener achieves AUC 0.977 on a remote homology test set where the BLAST baseline achieves only AUC 0.711, an improvement of 0.266 that directly addresses the AI evasion scenario.

Show Reviews (2)

Review 1

This felt like 3 papers stitched together. the ESM-2 vs BLAST piece is solid and would stand on its own. the blockchain + audit log + 5-attack-class stuff is described but not really tested, which made me trust the rest less. would've been stronger at half the length focused on ESM.

Review 2

The authors developed an ESM 2 screening protocol to detect AI-generated synthetic homologs and added a logging mechanism to their pipeline, as well as an assembly method for split orders. The end-to-end pipeline is multilayered and well executed, even if some version of each step in the process has been previously implemented. However, the study does have limitations, many of which the authors address at the end of the paper. In addition to what was mentioned in the paper, more limitations or suggestions would be: 1) Generating synthetic homolog test sets using other biodesign tools. Currently, ESM2 is mainly designing the training and test sets. I would be curious if the test set was generated with other tools and how that would affect the metrics. 2) Low amount of sequences was acknowledged, and moreover, different functional classes need to be tested with more complex mechanisms and proteins. Using the authors' method, this would require training with a very large dataset, but it may be possible to find efficiencies in a pipeline that would not require as much compute. 3) Using BLAST as a metric. I would have used the free screening tool as a basis of comparison rather than just BLAST

#24

BioGuard: Screening Biological Risk Across Multi-Turn AI Conversations

Track 3 Spotlight View PDF

Team: Jason Tang | Reviews: 2

Tracks: Track 1 Track 3

Most AI biosecurity filters evaluate isolated prompts. This misses the real threat vector: dual-use biological capability is accumulated incrementally across long conversations. BioGuard shifts the screening boundary from the isolated prompt to the continuous conversational state. Tested against a live frontier model (GPT-5.4), we identified a severe safety-utility tradeoff: current frontier models achieve safety via broad refusals that actively disrupt legitimate bioscience workflows (triggering a ~4.5% false-positive rate). In contrast, BioGuard traces Biological Knowledge Transfer (BKT) across entire sessions. Our prototype demonstrates that by isolating multi-turn capability accumulation, we can maintain necessary safety visibility while preserving operational utility for benign scientific research.

Show Reviews (2)

Review 1

This project is a thoughtful and creative approach to expand the biosecurity review to an overall conversation, rather than prompt- or end-product-level screening. The author also makes impressive efforts to make the BioGuard method interoperable and reproducible. I also appreciated the clarity with which contributions and methods are presented. There are several aspects that would improve the project. First, the "Depth" axis of Biological Knowledge Transfer (BKT) encompasses both procedural and tacit knowledge, and it would be useful to understand how each form of knowledge contributes to the scoring procedure. Second, the nature and overall layout of the benchmark used in this study is not yet clear, and would benefit from description in the main text. Finally, while the GPT 5-based filter does experience slightly elevated false positive rate (0.045), the maintenance of excellent recall (1.000) and precision (0.965), versus BioGuard, with recall of 0.289 and precision of 1.000, could be more beneficial in everyday applications. In other words, while the text states that there is a stark safety-utility tradeoff in favor of BioGuard due to elevated GPT 5-based false-positive rate, one could make the case that BioGuard's false positive rate of 0.000, at the expense of lower recall, is a more substantial tradeoff.

Review 2

Proposes monitoring entire AI conversations for incremental biological capability accumulation (what the paper calls Biological Knowledge Transfer) rather than screening individual prompts or final outputs. Each conversation gets scored on misuse relevance, procedural depth, and capability uplift, producing an auditable decision record. Why it matters: This is pointing at exactly the right problem. The capability uplift literature makes clear that dangerous knowledge accumulates across multi-turn interactions, not in single prompts. Current safeguards mostly evaluate messages in isolation. The conversational window in between is largely unmonitored, and that's where tacit knowledge transfer happens. If this worked, it would fill a critical gap in defense-in-depth. What's strong: Excellent problem identification. The decision envelope design (with request IDs, thresholds, anomaly records, and audit logs) is governance-ready infrastructure that would be useful regardless of which detector sits behind it. The paper is clear, concise, and honest about what works and what doesn't. What's missing: The detector catches only 29% of positive cases. That's too low for safety screening. More importantly, the ablation studies show that individual scoring components sometimes outperform the integrated multi-turn system (meaning the aggregation logic, which is the core contribution, is actually making things worse in some cases). The entire evaluation is on synthetic data, which can't test the indirect, contextual knowledge accumulation that the system is designed to catch. The keyword baseline detecting literally nothing raises questions about whether the benchmark is well-constructed.

All Projects

HGT Leaves a Linear Fingerprint in Codon Space

View PDF

Team: Yatharth Maheshwari, Arka Dash | Reviews: 3

Tracks: Track 1

Nucleotide virulence factor benchmarks are inflated by ~0.30 AUROC from organism confounds and gene-family leakage. Under same-strain negatives and gene-family-disjoint evaluation, 64-dimensional codon frequency with logistic regression generalises perfectly to novel genera (gap = 0.006, p = 0.097 NS). The signal is HGT-derived codon usage deviation - linear, genus-invariant, and pretrain-free.

Show Reviews (3)

Review 1

Thanks for your submission! Your thorough quantitative approach here is commendable, and the clearly spelled out limitations and future direction are great. The write-up itself is quite jargon-heavy and a bit on the long side, and I would like to see more discussion of the big picture relevance - what are the consequences of the overestimation, and what should we do about it?

Review 2

This paper is rigorous and the inflation decomposition finding is important for anyone building or evaluating nucleotide-level biosecurity classifiers. The same-strain design and gene-family-disjoint evaluation is an actionable practice for design. The main thing working against it is presentation density. The statistical analyses that makes the science trustworthy also makes the paper hard to absorb quickly. A shorter, punchier framing of the core result up front, with the full statistical components in supporting sections, would let the strength of the conclusion come upfront. The HGT mechanism is compelling but acknowledged as unvalidated and the amelioration-score correlation they describe as future work would substantially strengthen this. The team has demonstrated deep knowledge & thinking.

Review 3

This paper shows that published performance numbers for DNA-level virulence factor classifiers — the kind that could screen raw synthesis orders — are significantly inflated due to two testing mistakes that compound on each other. Once you fix the test design, a simple model that counts codon frequencies is the only approach that actually generalizes to organisms it hasn't seen. The proposed explanation is that dangerous genes acquired through horizontal transfer still carry a subtle "accent" from their donor organism's codon preferences. Why it matters: If you're evaluating nucleotide-level screening tools and relying on published benchmarks, those benchmarks are probably overstating performance by a wide margin. This paper quantifies exactly how much and why. The proposed fast pre-filter for synthesis orders (no GPU, no pretrained model, runs in linear time) is a practical contribution to screening infrastructure. What's strong: Best methodology in the batch. Same-strain controls, family-disjoint evaluation, 20 random seeds, pre-registered analysis, careful statistical reporting. The finding that more complex models consistently overfit while the simplest one holds up is clean and actionable. What's missing: The HGT mechanism is a hypothesis, not a validated result — the title overstates this. They haven't computed performance at the false-positive rates that synthesis screening actually operates at (below 1%). No testing on engineered or codon-optimized sequences, which is what screening actually needs to catch.

Quantifying the Reconstruction Gap: A Dataset Bottleneck Analysis Framework for AI-Era Biosecurity Screening

View PDF

Team: Finomo Awajiogak Orom | Reviews: 2

Tracks: Track 3 Track 1

Dataset Bottleneck Analysis (DBA) — Project Summary Biosecurity screening removes dangerous biological sequences from public databases, but a critical question remains unanswered: does removing those sequences actually prevent an AI-equipped adversary from reconstructing them using what remains? DBA is an open-source framework that answers this question empirically. We introduce a Redundancy Score (R ∈ [0, 1]) that measures how much of a restricted sequence set can be reconstructed from the public corpus. Applied to 4,844 real UniProt Swiss-Prot proteins with a cluster-aware split, DBA reveals a striking result: while BLAST-style k-mer screening achieves R = 0.064 (0% of sequences recoverable at ≥ 0.90 similarity), ESM-2 protein language model embeddings achieve R = 0.847 — 13.2× higher — with 95.5% of restricted sequences recoverable at the same threshold. This is the AI threat multiplier: the factor by which language-model-aided adversaries exceed the reconstruction potential assumed by sequence-identity policy. The most alarming finding is the toxin experiment. K-mer screening makes toxin proteins appear 64% safer than average (R = 0.023), creating a false sense of security. ESM-2 reveals the opposite: toxin ESM-2 R = 0.873 (98.6% coverage), exceeding random proteins (0.847) and exposing a 32× gap between what sequence-identity screening assumes and what a language model adversary can actually recover. DBA runs end-to-end in under 22 minutes on a laptop CPU with no GPU required. It is designed as a pre-deployment audit tool for screening programme designers: run it on your proposed screening category before setting thresholds, or you may be calibrating against the wrong adversary.

Show Reviews (2)

Review 1

Very interesting approach. The intersection of new protein language models and other biodesign tools with existing screening controls has not received much prior attention, to my knowledge. As the researchers reveal, this is an oversight because existing screening selection and calibration tools might give a misleading threat picture when considered in the context of new BD tools like ESM-2. This is an important vulnerability and the recommendation to use ESM-2 over k-mer is novel and valuable. Perhaps even more valuable is the R measure, which can be reapplied as new AI tools are released, in order to update screening selection. The project is well-thought out and well-executed. One thing that would make it a little stronger in terms of presentation would be to make the connection between the R measure and the recommendations clearer (especially for non-bioinformatics people like this reviewer). They might also provide general guidelines for applying this measure in future. Overall, am excellent project and valuable contribution to AI security.

Review 2

The core finding is genuinely striking and easy to grasp which is that current screening doesn’t just underperform against AI-equipped adversaries, it could actively mislead. The framework is lightweight enough to actually get used. The weakness is that the central claim, that sequences scoring above 0.90 similarity in embedding space are recoverable, is asserted rather than demonstrated. That equivalence is doing a lot of work and it’s not obvious it holds for the properties that actually matter in a biosecurity context. The experiments also run on generic protein databases rather than the sequences that real screening programmes actually restrict, so the jump to a policy recommendation is a bigger leap than the paper acknowledges. One concrete fix would be to show that high ESM-2 similarity actually predicts functional equivalence for at least one relevant property, whether that’s toxicity, receptor binding, whatever is available. Without that the policy recommendation sits on an assumption.

Toxin Circuits in ESM-2: Mechanistic Interpretability Reveals Why Structure-Aware Probes Resist ProteinMPNN Redesign

View PDF

Team: Manan Wadhwa, Shivam Dubey | Reviews: 2

Tracks: Track 1 Track 3

Background & Problem Standard biosecurity screening uses sequence identity (BLAST). ProteinMPNN redesigns toxin sequences below every BLAST threshold, achieving 0% detection across 723 redesigns. Proposed Solution & Mechanism A linear ESM-2 probe maintains 93.9% detection with no retraining. Using interPLM Sparse Autoencoders (SAEs), 50 features are identified at 205× compression that explain probe performance. These features are amplified by redesign (mean transfer ratio 1.28) because ProteinMPNN preserves structural fold topology—precisely what the circuit encodes. Security Analysis & Evaluation A four-tier attack taxonomy reveals the security boundary lies at gradient access: ProteinMPNN (6.1% evasion) vs. white-box attacks (100%). Direct Probe Attribution identifies layer 32 as the bottleneck (r = 0.992 redesign–toxin circuit correlation). SAE-based probes recover 38% of “Double-Evaders” that fool both BLAST and dense linear probes, demonstrating direction-sensitive detection beyond Euclidean boundaries. Discoveries & Conclusion Zero-shot scanning discovers 248 UniRef50 candidates enriched 4.75× for secreted signal peptides, including cross-kingdom fungal effectors (54% are currently annotated as “Uncharacterized” in UniProt). The probe’s security guarantee equals the privacy of its weights.

Show Reviews (2)

Review 1

Very interesting results and good progress for a hackathon weekend! I think people already suspected that pLMs could be quite helpful for that, but nice to see these numbers. I mostly wonder how much that changes with shorter sequences, though. This seems to be the crux, at least for me.

Review 2

An overall very strong effort. The comparison to BLAST screening is compelling and well elucidated, the demonstration of utility of a simple linear probe on a frozen model is motivating, and the variety of experiments probing the nature of this detector are mostly compelling. However, the work would be improved via further consideration of what it means for the detector to be vulnerable to a "white box" gradient attack, and the writeup suffers from some internal inconsistencies (e.g. caption vs content in figure 1, different assertions about the number of double-evaders).

View PDF

Team: Kailer Laino | Reviews: 2

Tracks: Track 1 Track 3

Screening tool that uses LLM embeddings to determine if a protein sequence is functionally similar. Performs much better than BLAST and shows that the embedding space for models trained on proteins seems to be very related to the space in which proteins are functionally similar.

Show Reviews (2)

Review 1

The project correctly identifies that screening based on sequence homology has large blindspots, and that embedding-based screening could solve some of these. I think it's a useful proof-of-concept. I thought there were three main things that don't quite work. First, the variants were generated by conservative amino acid substitutions within biochemically similar groups. ESM-2 was trained on such sequences because this is what happens during evolution. So, ESM-2 giving high cosine similarity to these variants is largely a consequence of experimental design, not evidence that it detects function. So, I don't think this is a great proxy for function, given how you designed sequences. You have shown that ESM-2 detects biochemical similarity, which is almost guaranteed given how the variants were made. It would be cool to see how a generating variants with ProteinMPNN affects ESM-2-based detection. Second, you don't have specificity as a metric. Sensitivity alone doesn't mean much for a detector. Third, your baseline is a bit too low, a more realistic one would have been to run the sequences through commec.

Review 2

Using protein language models to generate measures of similarity between sequences seems like a natural and sensible approach to me. (I am not an expert in DNA synthesis screening). I can imagine it fitting into a wider set of algorithms run as a part of the screening pipeline, and that it would improve accuracy. I found the write-up quite clear, and appreciated the effort put into validation and empirical results.

#23

Activation Probes for Synthetic Toxin Variant Detection

View PDF

Team: Maxwell DeFanti, Ishaan Panigraphi, Kevin Zhang | Reviews: 3

Tracks: Track 1

DNA synthesis screening prevents bad actors from obtaining the physical sequences needed to produce dangerous toxins and pathogens. However, current screening tools like BLAST and SecureDNA rely on sequence similarity to known threats, and recent work has shown that AI protein design tools such as ProteinMPNN can generate functional toxin variants that evade these screens at rates approaching 100%. We introduce a screening approach that trains an activation probe on ESM-2 embeddings to recognize toxic function across diverged sequences; on held-out synthetic variants that are ~40% identity to their parents, our classifier maintains 86.7% recall while BLAST recall collapses to 46.7%. This provides initial evidence that protein language model embeddings can be a robust second layer of defense for DNA synthesis screening, complementing current similarity-based methods.

Show Reviews (3)

Review 1

You have generated important new knowledge and contributed to the challenge of moving from sequence-based to function-based screening. I loved the discussion on the importance of lab experimental confirmation and the challenges it brings. I would like to have understood a little but more about how the time and compute factors limited your work. I think understanding the limits of what such approaches might achieve, and connecting that to resource availability adds another dimension to this interesting and important challenge.

Review 2

Toxins/toxicity are a really important focus area. I fully agree with the importance of more advanced synthesis screening approaches, and testing utility of ESM-2 embedding may help drive toward function-based approaches that we need. Results seem reasonable and clearly reported. If authors pursue this further, I would suggest continuing the thread on toxins as equally important to the task of applying the approach to bacteria and viruses (as they are publicly noted in terms of potential bioweapons activities by certain countries in the latest State Department treaty compliance reports). Appreciate the focus on ESM-2 rather than other open models for which potential risks/info hazards are less well explored to date.

Review 3

This addresses the right problem at the right time and the pipeline design is genuinely thoughtful, particularly the cluster-aware evaluation and the ablation showing ESM-2 embeddings carry enough functional signal to generalise without synthetic training data. The central weakness is that the headline results rest on 15 sequences per divergence level, which means the recall numbers could shift substantially with a handful of different outcomes. Acknowledging variance across runs is honest, but it also undercuts confidence in the specific numbers the paper leads with. The false positive rate tripling relative to BLAST also needs more engagement, because in a real screening deployment that’s the number that determines whether providers actually adopt the tool owing to operational costs. Scaling up the synthetic evaluation set and stress-testing the false positive rate at operationally realistic thresholds would turn this from a promising proof of concept into something deployable.

#25

Towards Hardware-Governed Benchtop DNA Synthesizers

View PDF

Team: Oraya Srimokla, James Petrie | Reviews: 3

Tracks: Track 4

Benchtop DNA synthesizers may soon enable bioweapon synthesis in individual labs without hardware-enforced controls. We propose a hardware design with three layers of defense: sequence screening, a regulator signature the device refuses to run without, and physical monitoring of the synthesis process. The first two reuse hardware primitives from AI chip governance. The third is novel, and addresses an attacker who submits a benign sequence and physically tampers with the device to produce a hazardous one instead.

Show Reviews (3)

Review 1

With the increasing accessibility of benchtop synthesizers, knowing how to mitigate the synthesis of sequences of concern is vital. This paper gives concrete recommendations for screening and authorization in a benchtop synthesizer and how to prevent tampering of the synthesizer. The recommendations against tampering were interesting and well worth consideration. However, I found the recommendations for screening and the requirement for the regulator to pre-approve each sequence to be impractical. The vast majority of sequences are benign, and having them require approval would be a huge burden. Rather, sequences with high homology with sequences of concern should be pre-authorized only. There was also little data to validate the approaches. Recommendations on how the screening tool can be updated without tampering and how it would handle AI-generated oligos would have been useful as well.

Review 2

I found the problem and framing to be quite good in presentation. There was some jargon used from the hardware/AI/cyber-security side, but I think most could follow. I thought the connection to printer ink cartridge authentication was a clear example, and I think more analogues or exemplars could have been used in other proposal areas. Since this was an evaluation and under a time-limit, the designs are understandably conceptual, and Figure 1 provides useful context. However, an additional diagram(s)/table(s) illustrating the broader threat landscape and potential failure points would have been nice. One area that was neglected is how calibration, maintenance protocols, and service contracts would fall into this design (ex. how might calibration drift affect pipetting or volume-detection and then be adjusted). This would likely be an area where security could be firmed up, but it would be useful to identify where it may provide failure-points or access to bad-actors.

View PDF

Team: Jayani Srinivasan | Reviews: 2

Locus automates researcher credential verification for DNA synthesis screening by mapping ORCID publication profiles to NCBI taxonomy identifiers. A novel trajectory feature detects deliberate biological capability acquisition that static credential checks cannot capture.

Show Reviews (2)

Review 1

The Chrome extension functions well and behaves as intended. Both the report and the demo video are clearly and professionally assembled. The concept of using ORCID and PubMed to automate KYC credential verification for DNA synthesis customers is reasonable, but it comes with notable limitations, many of which the report already acknowledges. In practice, this approach verifies only a narrow subset of customers (primarily academic researchers), who are also among the least likely groups to misuse synthetic DNA. In academic environments, orders are often placed by lab managers or core facility staff who may have no publication record at all. For such cases, incorporating institutional email‑domain verification could strengthen the workflow. Additionally, individuals with only review, commentary, or perspective articles on the listed organisms, without any hands‑on research experience, would likely pass the current screening, which weakens its effectiveness. The implementation also appears to misunderstand the split‑order problem. The issue is not when someone orders DNA fragments from multiple listed organisms; rather, it arises when a customer orders different segments of a gene or genome from the same listed organism in separate transactions. Overall, this is a good and creative attempt with some novelty, but its impact is limited by the constraints of the proposed KYC solution relative to the broader challenge.

Review 2

Locus presents a browser-based tool for automating researcher credential verification in DNA synthesis screening by linking ORCID profiles to biological taxonomy . The core idea is relevant for AI-biosecurity but the current system is somewhat simplified. In practice, robust KYC-style verification would likely require richer and more diverse data sources beyond taxonomy alone. Also, it may struggle with edge cases such as generalist researchers, sparse publication records, or non-academic actors. That said, execution is good for a hackathon project, including the video demonstration.

#36

OSINT BioGovernance Tool

View PDF

Team: Zhamilia Klycheva, Edidiong James, Erik Leklem, Shubhankar Dharmadhikari | Reviews: 3

Tracks: Track 3

This project proposes tools for faster identification, classification, and response to AI biosecurity “warning shots”—near-miss events that reveal catastrophic risk—so as to translate those events into governance action by relevant policy, intelligence, and other actors. We use a formal definition of warning shots and associated criterion to analyze 21 global biosecurity events, finding that warning shots are typically recognized but rarely converted into binding governance. We propose analysts and governance actors use a Governance Conversion Framework to improve future AI biosecurity warning shot responses. We develop an AI biosecurity event dashboard and set of analytic tools to help biosecurity stakeholders monitor world events, categorize emergent biosecurity risks, and trigger faster response with relevant government and industry actors.

Show Reviews (3)

Review 1

Innovation There is a genuine gap here: no public-facing warning-shot dashboard exists specifically for AIxBio, and analysts and policymakers do need a structured way to triage emerging biosecurity signals against historical patterns. You correctly identify this gap and build something concrete to address it. The combination of a warning-shot classification + risk score + governance-conversion lens is a reasonable contribution package, and the headline insight (that the bottleneck is conversion, not detection) is a useful framing. However, you do not engage with relevant prior literature. The Institute for Security and Technology released an AI Loss of Control Risk: Indications & Warning framework in February 2026 that uses the same intelligence-community I&W methodology you draw on, with a five-level severity scheme. It touches a different set of threat scenarios (AI loss of control, not bio), but it is the closest direct analog to what you're building and should be cited. More fundamentally, your Governance Conversion Framework is a domain-specific application of focusing-event theory in political science (Birkland's After Disaster, 1997, and 30+ years of follow-up work; Kingdon's multiple-streams framework). Your "Stage 3 stall" finding is a special case of what focusing-event scholars have been documenting for decades. Near-miss management literature in industrial safety, aviation, and mining is a third unacknowledged precedent (e.g., MSHA's quarterly near-miss reporting mandate is a working example of the governance infrastructure you say is missing). Adding a paragraph that situates GCF within these traditions would substantially strengthen the contribution. See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2300441 for a starting point. Execution Your Risk Score is overcomplicated. Score = (S − 5) / 20 × 100 simplifies to (S − 5) × 5. Either justify the form or simplify. Relatedly, if the post-hoc step is to subtract 5, why not score each pillar 0–4 to begin with? Justify those choices, or don't introduce them. You then describe the methodology as if the score were more granular than it actually is. With five integer pillars in [1, 5], the composite has at most 21 distinct possible values, all multiples of 5 after normalization. Your priority band "Low (0–29)" therefore contains only 6 reachable values, not a continuous range. Reporting the score as a 0–100 number implies a level of resolution that the underlying scale does not support, justify the choice. A bigger issue with the scoring is that it is ordinal and you do not justify how the ordinal scale corresponds to actual quantitative risk. Adding ordinal categories is mathematically problematic. Expert validation is out of scope for a hackathon weekend, but the paper would benefit from at least acknowledging the measurement-theory limitations and discussing methods of addressing them in future work. See e.g. https://arxiv.org/pdf/2103.05440 for the standard reference. The C1–C5 criteria are vague in places. What "epistemically accessible at the time" means in a coding rubric is not operationalized, what counts as accessible to whom, on what evidence base, judged at what time horizon? Standard SOTA for qualitative classification frameworks is to report inter-rater reliability (Cohen's κ or Krippendorff's α) across at least two independent coders. You don't do this, for understandable reasons given the time budget, but this should be explicitly listed as a limitation rather than left implicit. Your case selection methodology is not described in enough detail. The Data Collection appendix lists source types (peer-reviewed literature, institutional reports, investigative journalism, government records) but not the inclusion criteria, search strategy, time bounds, exclusion rules, or selection process. You acknowledge "potential selection bias" generically, but don’t describe how borderline cases were handled. Were you systematically searching or working from familiar examples? Without this, readers cannot assess whether the headline pattern reflects the world or your sampling. You should state explicitly whether the dataset is intended as exhaustive, illustrative, or convenience-sampled. Your headline finding that most cases stall at Stage 3 rests on counts in a hand-curated sample of n=21 (and the GCF stages themselves were derived from n=13). There is no statistical treatment of this claim, no confidence interval, no comparison against a base rate, no engagement with the focusing-event literature where this same pattern has been studied at much larger scale. The conclusion states "as AIxBio risks continue to converge and accelerate, the cost of stalling at Stage 3 will only increase." Cost is never assessed anywhere in your framework, your score measures risk, not expected cost of governance failure. Either add a cost-of-inaction component (or even a placeholder for one), or rephrase the conclusion to match what your tool actually measures. You write that "Figure 3 illustrates the distribution of cases across time, tier classification, and risk level, showing increased clustering in recent years." Recent-years clustering is exactly what you'd expect from any non-exhaustive OSINT-curated dataset (recent events are better-documented and more salient to curators). If you want to claim a real temporal trend, you need either an explicit exhaustiveness argument or an analysis that controls for differential discovery rates across decades. Presentation The paper itself is clear, well-organized, and easy to follow. Section structure is logical, the three-stage pipeline (OSINT scanning → risk scoring → GCF assessment) is communicated cleanly in Figure 1, and the writing is at a level appropriate for a policy audience. Figure 1 writes the composite formula as S = H · E · C · V · DR, which parses as multiplication, while Section 3 of the text gives S = H + E + C + V + DR Figure 1 is also not very readable, especially being placed before the part of the paper that explains it. Accessibility of the dashboard is poor. The low-luminance green-on-black text appears to fail WCAG AA contrast minimums in several places, and color carries a substantial semantic load (yellow = high, green = pass, red = critical, dim green = inactive) without redundant text encoding, which loses information for the ~8% of men with red-green colorblindness. See https://www.w3.org/WAI/WCAG21/Understanding/contrast-minimum for the standard. Adding a high-contrast mode and redundant text labels alongside color codes would address both issues.

Review 2

Covers an important topic in AI x Bio risk, which is recognizing early warning signals and then acting upon them in a way that mitigates future risk. Dashboard works very well for what it sets out to do and is useful as a means of getting an overview of historical early warning. As an educational (or even eventually a research) tool it is a great addition. However, it is in the governance aspect where - as revealed by the project - the key deficits lie. I am not convinced overall that a dashboard, even one that logs governance failures in responding to early warning, gets at the crux of the problem. This is because I do not believe that the problem is one of insufficient awareness on the part of authorities, but rather insufficient prioritization and over-politicization of biosecurity risks. I do not see a dashboard - even a very attractive and user-friendly one - doing much to mitigate these obstacles. Maybe around the margins, because they allow a clear and concise story to be told, but I do not think this is going to do much to move the needle on actual government response to risk. Overall, a well-executed project with strong informational and educational value, but it is unlikely to directly contribute much to decreasing biosecurity risk.

Review 3

I will mostly focus on the dashboard and some of the classifications I looked at in detail. In principle, I believe warning shots can be a really powerful motivator for policy changes and I do fear that AIxBio will only be taken fully seriously once an actual misuse event happens. That being said, I am a little doubtful of some of the scores, e.g. why various flu transitions and spillovers are high-risk AIxBio warning shots. I agree that those events are high risk, but they also don't particularly demonstrate a vulnerability that was previously unknown, or are AIxBio related. The website can be very nice as a resource for biosecurity researchers searching for more context and rough assessments of different historical and ongoing events, but I am a little doubtful that it can serve as a policy-oriented platform for warning shots.

#37

Semantic Naturalness Predicts Monitor Evasion in Biosecurity LLM Gatekeepers

View PDF

Team: Arielle Berthe | Reviews: 3

Tracks: Track 3

This work examines the extent to which a safety monitor for large language models (LLMs), configured for biosecurity, can be evaded using ordinary natural language, without adversarial prompting or technical expertise. We evaluate a monitor deployed via API across five bio-relevant topics under three controlled conditions: direct requests, single-turn contextualized queries, and multi-turn conversational sequences. The results show that detection behavior is strongly associated with the semantic orientation of the evaluated request. Operationally framed queries are consistently flagged, whereas descriptively framed queries, particularly in academic contexts, frequently evade detection. In multi-turn settings, sequences composed exclusively of descriptive turns achieve systematic evasion within the evaluated environment, even when they produce technically detailed outputs in the full assistant setting. Based on these findings, we introduce the concept of semantically mediated distributed extraction, in which no individual turn triggers detection, but the overall interaction produces sensitive information. We also document a divergence between API-based monitoring and full-model behavior, suggesting that architectural factors may influence robustness against context-based evasion strategies. We propose incorporating semantic orientation, distributed extraction, and cross-environment divergence into evaluation frameworks for AI safety systems. These results highlight the relevance of semantic and interaction-level dynamics in the analysis of monitoring mechanisms.

Show Reviews (3)

Review 1

Great job and interesting project choice! Your writing is verbose and some sections are repeated. That being said, most sections are necessary and very clear & understandable. Your work also formally identified the effect of something many biosecurity researchers probably felt was important, paving the way for better safeguards down the line. Send this to biosecurity researchers you think might benefit from knowing about it!

Review 2

Clean A/B/C design. The monitor claim is custom prompted Claude API binary classifier, so I would suggest either reframing the claims of biosecurity LLM gatekeepers or add a second monitor (e.g. Llama Guard). I also suggest strengthening the description of browser evaluation by providing some kind of structured rubric or even a coarse one.

Review 3

The A/B/C framework is more disciplined than typical jailbreak demonstrations and the complete-separation result in the logistic regression is a cleaner empirical finding than most evasion papers offer, but the work's framing travels further than the evidence supports because the monitor evaluated is a strawman. A one-line system prompt asking for FLAG or PASS is not representative of production biosecurity monitoring, which uses constitutional classifiers, multi-stage pipelines, or fine-tuned detection models, and showing that a minimal binary classifier fails on descriptive academic framing tells the reader something about that specific configuration rather than about LLM monitoring generally. The architectural divergence observation compounds this, because the comparison is between the author's custom classifier and the full Claude assistant in the browser, which are not two monitoring architectures but a strawman classifier and a production system doing different things, so the inference that model sophistication moderates evasion is not actually supported by the comparison. The most valuable single revision would be evaluating the same framework against at least one more sophisticated monitor (the Sharma et al. constitutional classifier the author already cites is an obvious candidate), which would either generalize the finding or sharpen it into the architectural result the paper currently gestures at without testing.

#38

Blindspots - Atlas for Testing Visibility and Local Conditions

View PDF

Team: Mushraf Ali Anver | Reviews: 3

Tracks: Track 2 Track 3

Pandemic response depends on accurate detection and measurement. However, testing data has blind spots or poor visibility in many areas. During the critical early pandemic stage, predicting which communities and neighborhoods have blindspots can help control pandemic spread. Missing tests also matter in determining population antibodies, group vulnerability and several other pandemic measures. In contrast to extensive work on pandemic mortality, the data of who takes tests and who does not is just as relevant but neglected in Pandemic management. This project builds a prototype pandemic response atlas using Covid-19 France and the greater Paris area as a case study. The prototype is an R Shiny web application that provides three map layers: COVID testing visibility across Île-de-France, IRIS-level socioeconomic conditions, and a high-resolution 200m Paris socioeconomic layer. Its main contribution is predicting better public-health surveillance interpreted with local socioeconomic context. Based on socio-economic conditions, policy makers should then better preventively protect the blindspot areas.

Show Reviews (3)

Review 1

The submission primarily deals with tracking testing and potential areas of testing blind spots in conjunction with socioeconomic indicators. This is of secondary importance for pandemic early warning, as it related to the detection and tracking of an outbreak already spreading and identified (hence the availability of targeted tests). While the methods seem robust, and it's true that low testing can mask the presence of a pathogen, particularly when better-off areas are testing more, the proposal is not very novel, as undertesting is a well understood phenomenon. The development of a dashboard to track this is potentially useful, although very much a proof of concept as develop here.

Review 2

- Including a concrete example from a takeaway or learning you had from your particular Île-de-France data would have added some value to show an example for the practical takeaways this approach can generate. - Some quantitative estimate of the degree of the discrepancy in testing and socioeconomic environment and the results on morbidity and mortality would be interesting to prove the scale of the problem.

Review 3

I think that the project is right to point out that there is lots of variation in testing and that matters for ID surveillance. I don't personally know the extent to which considerations like this were properly accounted for e.g. during COVID-19. As the author notes, this could be important for pandemic management (more-so than early warning, in my opinion). The project doesn't make claims about being validated etc. (and notes this as an opporunity for future work). To go from these maps to making decisions I agree that decision makers would want to see some kind of validation. For now this is more of an exploratory tool. I think in terms of general applicability for pandemic preparedness I wonder about the data streams (testing uptake) and how available they would be in a novel outbreak.

#39

PROJECT MOSAIC: Defensive Protein-Aware Screening for Benchtop Synthesizers

View PDF

Team: Abubakar Abdulfatah | Reviews: 2

Tracks: Track 1 Track 4

Project MOSAIC provides an open-source, Protein-Aware Mock Screener tool designed to defend against "context-scrubbed" multi-agent LLM workflows. We demonstrate that while simple DNA-level screening (Hamming distance) can be trivially evaded by generating synonymous codon substitutions using commercial LLMs (yielding 3 unique evasion payloads across 9 tested trials), our Layer-2 Protein Translation Screener catches 100% of these adversarial payloads by examining the translated amino acid homology, providing a robust defense against synonymous-only substitutions. This project explicitly models how malicious users might split a dangerous request into benign subtasks across different models, and provides the defensive code necessary to catch the resulting obfuscated sequences before they reach physical synthesis.

Show Reviews (2)

Review 1

I appreciated an attempt at a red-teaming pipeline and found the paper relatively clear to read. The use of multiple models in its pipeline was also appreciated. However, the obfuscation techniques and screening strategies proposed are not novel compared to state-of-the-art screening tools. In addition, while the paper proposes defensive screening for benchtop synthesizers, the bulk of the paper focuses on its red-teaming and screening strategies and not enough on implementation methods specifically into benchtop synthesizers. I would either hone in on that implementation or on novel red-teaming and stress testing strategies that can be done, building on methods used by current screening tools.

Review 2

It’s an interesting project that encompasses red teaming, tool development, and policy recommendations. The “multi-agent context-scrubbing” attack seems to be clearly described. The end-to-end approach here is very good, in that it starts with a vulnerability, exploits it, defends against it, and ties it back to policy. Empirically, the project has good design that includes the control group and uses cross-model testing on several frontier models. Defensively, it gives an applied screening argument which is further improved by using an open-source protein-level screen. Policy-wise, it talks about the risks of benchtop synthesizers which are highly relevant and one of the main bottlenecks of DNA synthesis screening, and the project presents somewhat useful policy guidance on this topic. Where it could be stronger: Novelty varies, where the biological principle that screening at the protein level works better than screening at the DNA level is common knowledge in industry and the use of HMMs in screening (commec uses it). While the project sometimes overstates the novelty of the work, claims such as "absolute mathematical guarantee" are quite hyperbolic, considering the authors' admission of non-synonymous evasions and structure-preserving mutations. The experiments are constrained by the choice of a benign sample (GFP), lack a test on more recent synthesis companies, modern screening pipelines, or even more sophisticated attacks.

#40

Function Over Sequence: Empirical Evaluation of Protein Language Models for Biosecurity Screening

View PDF

Team: Saahir Dhanani | Reviews: 1

Tracks: Track 1

DNA synthesis screening is a critical biosecurity chokepoint, but current tools detect dangerous sequences by similarity to known threats — a paradigm that collapses against AI-assisted protein design. Using ProteinMPNN and ESMFold, we generated 12,000 evasion variants of 50 known toxin proteins across 12 sampling temperatures, producing sequences with as little as 11% mean identity to known toxins. We evaluated ESM-C 600M protein language model embeddings as a function-aware alternative against BLASTp and commec baselines. At T=1.5, BLASTp detects 3.5% of variants and commec detects 2.9%, while ESM-C kNN maintains 79.5%. Among variants structurally predicted to retain wild-type function, commec detects 0% and BLASTp detects 3.8%, while ESM-C kNN detects 100%. We additionally identify organism-matched negative construction as a necessary methodological requirement for honest evaluation in this space, showing that naive dataset construction inflates AUC by up to 0.016 and FPR by 58 percentage points.

Show Reviews (1)

Review 1

I love that you have taken the next step in answering this question. Your approach builds on past knowledge but produces important additional insights. I was also glad to see that you map out clearly how your work impact our knowledge based and how it impacts necessary future research. I think you answered an important question and have directly contributed to efforts to strengthen synthesis screening against AI-base circumvention. It is a shame that you did not / were unable to have tested this against the full commec installation. It would have further increased the saliency of your work and increased the knowledge generated

#41

Biorisk-gym: A new direction for “rule-in” AI model evaluations for biosecurity

View PDF

Team: Aditya Kasarla, Aanika Dalaal, Kailer Laino | Reviews: 3

Tracks: Track 3

Accurately measuring the biosecurity risks of rapidly advancing AI models is a critical security challenge. Current benchmarks rely on static, multiple-choice tests that act as "rule-out" evaluations, failing to capture realistic, multi-turn adversarial interactions. Conversely, high-fidelity human uplift studies are accurate but take far too long to match modern AI deployment cycles. To bridge this gap, we introduce Biorisk-gym, an automated, dynamic framework prototyping scalable "rule-in" evaluations. Our approach utilizes a three-agent architecture—an adversarial Gamemaster, a Target model under evaluation, and a Judge—to simulate multi-turn escalation across six stages of biological threat creation. We evaluated Claude Haiku 4.5, Claude Sonnet 4.6, and Llama 3.3 70B across 12 distinct attack scenarios. Results reveal stark, model-dependent vulnerabilities: Llama 3.3 70B demonstrated the highest mean peak biorisk uplift, whereas Sonnet 4.6 exhibited robust safety mechanisms. Crucially, nearly 40% of scenarios reached their highest threat scores on later conversation turns, demonstrating that single-turn evaluations systematically underestimate an AI's vulnerability to sustained adversarial pressure. We recommend next iterations of Biorisk-gym that aim to provide a powerful pre-release methodology for identifying dangerous latent capabilities. Finally, because these automated evaluations generate highly sensitive attack protocols, we propose a secure, tiered deployment roadmap to responsibly mitigate the resulting infohazards.

Show Reviews (3)

Review 1

The authors present an important perspective on the need for rule-in evaluations that are more efficient than human uplift studies, and that incorporate multi-turn dynamics. While the three-agent architecture (gamemaster, target, and judge) is in accordance with canonical approaches from literature, we are appreciative of the work. Two minor comments are that the Virology Capabilities Test includes the grading of open-ended responses with a rubric, and that limitations regarding compute credits are understood. More substantial adjustments that could benefit this work include precise descriptions of the experimental design. For example, how were specific personas selected for testing? Why were the Claude Haiku 4.5, Claude Sonnet 4.6, and Llama 3.3 70B models selected for evaluation? How is language model engagement measured? Also, the metrics used for evaluation require revision - for instance, the sum of specificity and accuracy should be justified or alternatives should be considered.

Review 2

This project produces a framework for multi-turn model assessment, elegantly quantifying a known phenomenon by showing that a significant fraction of scenarios escalate beyond the initial response. The extensive limitations and future work sections demonstrate real thoughtfulness on the part of the authors and should be commended. I would like to congratulate the team for this piece of work, well done! Accurate scoring is the hardest part of this pipeline, and the three criteria used (engagement, specificity, accuracy) only partially capture how much uplift a response actually provides. Additional criteria such as detail, actionability, and potential for harm would better reflect real-world threat models. The reasoning behind the score formula could also be made clearer. For instance, how would scores differ between a full virus production protocol with no numerical parameters versus one with all the details? Calibration against human scoring would help reveal where model judgments diverge. As acknowledged in the limitations section, scoring is bounded by the judge model's capability ceiling, self-preference bias, and refusals. These represent key challenges of the framework and are worth exploring further. The finding that biorisk assistance capabilities differ between models is well expected, so characterising the qualitative differences would add significant value to the findings. While limited time and infohazardous content are acknowledged, additional details such as a product demo, a benign example, the distribution of scoring, or high-level information on parameters like persona and scenario would strengthen the proof-of-concept.

Review 3

The rule-in vs rule-out framing is exactly right, and the finding that nearly 40% of scenarios peaked on later turns rather than turn 0 is the most policy-relevant empirical result in this cohort. Single-turn evals systematically miss real risk, and you've demonstrated that with real models. The ceiling right now is judge reliability: Haiku 4.5 evaluating Haiku 4.5 conflates escalation with same-family self-agreement, and the expert-grounded rubric path you outline is the right approach.

#42

Automated Causal Graph Extraction and Value-of-Information Prioritization for AI Biorisk Modelling

View PDF

Team: Douw Marx | Reviews: 2

Tracks: Track 3

Quantifying risk at scale requires defining and prioritizing hundreds of causal paths to harm. I present an automated pipeline that (i) extracts causal chains from a single source document with an LLM, (ii) collapses near-duplicate nodes using embeddings and paired merge-proposer / merge-validator LLMs, and (iii) elicits Beta and PERT priors per node. Nodes in the risk model are then ranked by betweenness centrality, Birnbaum importance, and Expected Value of Partial Perfect Information (EVPPI), after Monte Carlo sampling. The pipeline is demonstrated on biorisk and serves as a proof-of-concept that LLMs can prioritize risk and indicate where new evaluations would most change downstream decisions.

Show Reviews (2)

Review 1

The problem area is neglected - we need more rigorous tooling for intervention prioritization in AIxBio, and this work addresses that gap. However, the intended audience and theory of change are unclear from the paper; there are some mentions of it in the limitations and future work section, but it should be clear for the beginning who this tool is directed at, how it should be used, what problem it solves and what the downstream implications are. “Granularity” is undefined (although the author notes this explicitly, which is appreciated). The paper would benefit from an operational definition and a sensitivity analysis. Using LLMs to elicit priors is risky, since LLMs are often overconfident and can produce numbers detached from reality, although I did not read the AutoEllicit paper and I can see how this is a necessary choice given the scope of the project. However, Gemini 2.5 Flash Lite is a weak choice for this, since it’s a step that is genuinely difficult for LLMs, which is why a frontier model (or several) should be used - I understand the budget constraints, but the paper would benefit from an acknowledgement here. Connected to that, I have a problem with how P(X | any parent active) is a single Beta but the actual conditional probabilities could vary substantially across the parents.. Either model parent-specific conditionals, keep the structure as a forest, or argue why the merged single-parameter approximation is acceptable. As written, the merge step invalidates the elicitation model. This is the most serious methodological issue I have with the paper. Standard betweenness sums over all s ≠ t, but in a risk DAG, only source-to-outcome paths are semantically meaningful. A source-to-outcome restricted betweenness (or flow betweenness) would be more appropriate. Notation is not properly introduced in the paper, which makes it hard to read and sometimes confusing, especially with some indexes colliding. Figure 2 is unreadable - the caption says "zoom for node labels", but those labels are unreadable at any zoom level. This is the only chance the reader has to see what the graph contains, especially with the pipeline being built in a way that does not allow easy replication.

Review 2

- A limitation for discussion is that biorisk modeling can be significantly based on classified information or infohazards, which can reduce the overall applicability. - Expanding towards taking in multiple sources. In the real world, multiple sources would inform overall evaluation priorities. I was not quite sure from the methodology if that'd be possible. - Explain why evaluation prioritization is a current bottleneck in biorisk reduction. - A paragraph on how plausible your results were would have been helpful. Do the recommendations broadly align with overall evaluation priorities? You could discuss how human graders would judge the outputs of the automated scoring and the plausibility. - A control condition where you compare this method to asking an AI Agent directly to extract the information and differences in outcome would have been interesting.

View PDF

Team: Taggart Tufte | Reviews: 2

Tracks: Track 2

Multi-signal pandemic surveillance combines wastewater, search-query, information-seeking, and clinical signals under the assumption that adding sources improves detection. We test this assumption across the multi-year endemic transition of COVID-19 and contrast against influenza. Across attention-based signals (Google Trends, Wikipedia) we find 5–23× variance compression after the first major COVID-19 wave but no compression across flu seasons; wastewater is the only signal type with stable variance across both diseases. The diagnosis is a novelty cycle in public attention specific to emerging pathogens. We additionally report a corpus-scale negative result on LLM-prompt surveillance using WildChat-4.8M (3.2M conversations), and a system-design proposal for privacy-preserving aggregate releases.

Show Reviews (2)

Review 1

This was a really exceptionally executed and presented project, with some very clear explanations and compelling data visualizations. It was also an impressive amount of work to have achieved in the timeframe. On style and presentation, my only gripe was that the work was structured and began quite formal and academic, but then transitioned to a more 'bloggy' tone. Though the clear explanations and signposts made up for that. Other points of improvement were 1) true relevance to the problem in the Track and 2) the methodology/results felt like they were somewhat overcomplicated/contrived. On point 1), the work mainly considers how signals change during an outbreak that became much less catastrophic over time, rather than contributing to a true early-warning system or initial detection. This is useful from a surveillance perspective, but is quite a bit less marginally useful than if the work addressed pre-outbreak warning systems. It would have been interesting to test whether there were any signals that reflected the changing virology/epidemiology of the COVID-19 variants (e.g. do any of the input signals reflect changing symptoms/lethality/transmissibility for which we have good data for a given variant). On point 2), while impressive, the project often felt like it described quite 'common sense' findings, e.g. people got 'COVID-fatigued' and stopped searching things on Google. There was a lot of fairly advanced mathematics and statistics in this study that were outside of my expertise, so I would liked to have seen more application of Occam's razor - some simpler descriptive statistics might have painted a similar story and been more accessible to a wider audience.

Review 2

Report shares finding that online user signals of attention to a particular pathogen do not reflect ground truth levels of incidence in the long run. Report notes that the historical decay of public attention to COVID-19 relative to its incidence appears to a phenomenon stemming from its prior novelty as an emerging pandemic. Report highlights that wastewater surveillance remains a reliable signal of a pathogen's incidence, unlike online signals of attention to the pathogen, which may not always correspond. Code repository link in PDF submission did not work, at least on the submission review platform. Repository could not be found on the submission author's personal GitHub page. Report appears generated at least in large part via LLM assistance, though a note on how AI-technologies were used in the submission was not included. Implications of report findings are not highlighted enough- with a document of this length, it is extra important to make the take-aways clear.

#47

WHO WROTE THIS SEQUENCE?

View PDF

Team: Babita Singh | Reviews: 3

Tracks: Track 3 Track 1

Today, any curious mind can open the laptop, design a novel enzyme, order it synthesised, and have it on a bench before any registry knows it exists. This is an extraordinary scientific advancement, but without the right infrastructure, a biosecurity problem waiting to compound. Generative AI is producing novel proteins and genes faster than the field can catalogue, evaluate, or attribute them. No shared infrastructure exists to distinguish AI-designed sequences from naturally occurring ones, screen them for biosafety, or credit their creators. ArtGene-Archive (artgene-archive.org) is the first dedicated registry for AI-generated biological sequences. Every submission passes an automated three-gate biosafety pipeline, receives a cryptographically signed certificate anchored to a tamper-evident audit log, and is issued a citable Registry ID. Built on experience at the European Genome-phenome Archive and grounded in emerging AI biosafety research, this dedicated archive solves a specific structural gap: provenance and safety certification at the point of design, not the point of discovery. What it needs now is what GenBank needed in 1982 - institutional commitment, knowledge contribution, and collective adoption.

Show Reviews (3)

Review 1

This is a well-presented and executed project. However, despite the very personal description of the motivation, I struggle to see how this project would specifically help prevent the misuse of AI-enabled biological design. ArtGen Archive is built as a voluntary database that can help scientists take ownership of their projects, and the built-in biosafety screening is commendable for avoiding accidental biosafety mistakes. Malicious actors would likely just not use the platform, and it iI do not see a strong pathway by which this paltform lowers malicious actors' ability to access dangerous pathogens or otherwise reduces the likelihood of deliberate release of biological agent. Without such additional justification I believe that this project is off-topic for this hackathon, despite its excellent execution.

Review 2

This is an impressively creative proof of concept that elegantly demonstrates how protein language models, biosecurity screening, watermarking, certification, cryptography, and blockchain could be combined to build an infrastructure for attributing AI‑designed biological sequences. It’s remarkably well put together, especially given the short timeframe of a hackathon. In an ideal world, a system like this should already exist. Unfortunately, reality is far more complex, and there are numerous practical, technical, and governance challenges that make deploying such an approach extremely difficult. And if even parts of this vision eventually become feasible, they could contribute to stronger biosafety, though not necessarily biosecurity. Even so, it represents a meaningful step in the right direction.

Review 3

Feels like a solution looking for a problem.

#48

Bio Safety Prompt Robustness Evaluation: Do Frontier LLM Safety Refusals Hold Against Adversarial Rephrasing?

View PDF

Team: Aditya Singh | Reviews: 3

Tracks: Track 3

Frontier language models are increasingly being used in biology research contexts, but most of them rely on safety fine-tuning as the primary defense against misuse. We wanted to know whether that defense actually holds up when you change how a harmful request is framed rather than what it is asking. To test this, we built a simple evaluation framework and ran 15 prompts across three categories, benign biology questions, direct misuse queries, and adversarially rephrased versions of the same misuse queries using professional and academic framing, against both Claude Sonnet 4.6 and GPT-4o with no system prompt. We found that adversarial rephrasing eliminated all full refusals across both models. GPT-4o was significantly more permissive under adversarial framing, fully complying with 3 of 5 rephrased queries compared to 1 of 5 for Claude, meaning an audit using only direct queries would reach the wrong conclusion about which model is safer. Both models fully complied with an anthrax weaponization query when it was framed as historical journalism research. We also attempted automated labeling using a second Claude instance as a judge and found only a 10 to 20 percent success rate, suggesting models apply their own safety heuristics when evaluating bio-relevant content and produce responses that break structured output format. We release our task set and evaluation framework to support future biosecurity auditing work.

Show Reviews (3)

Review 1

Overall, the project is well scope and clearly communicated, although not that innovative. The paper tests how different questions and their framings (adversially disguided, directly question, harmless) impact refusals, finding that adverserially disguised questions are more likely to be answered. This has been done in other contexts, including recently for bio (https://securebio.org/biotier/). The methods could have been improved by using a newer model than GPT-4o (makes the results less relevant) and evaluating each model on each question multiple times, capturing some of seeming stochasticity in reffusal behavior. Misc notes: * Appreciate the manual labeling of prompts – gives me more confidence in the results. * For the "partial" response, it would have been nice to expand. Plausibly, the safety policy is actually quite good at letting models answer misuse-relevant questions in a manner that poses little direct risk (e.g., to vague to be practical) while not refusing out-right. That seems like potentially the right choice?

Review 2

This is useful and informative work but does not add too much novel. It does show certain adversarial prompting approaches can bypass model level refusals but, as pointed out in the 'dual use concerns' section this is fairly well known in the field already. It does provide some data around that on two frontier level models and highlights how direct vs adversarial framing can change conclusions, but this is not highly novel. More prompts across more models would strengthen it and the argument evaluations in this manner should be performed. They are, but perhaps the argument in the work is this is important and should be performed more? However, much of that work is not public. The write-up was clear and easy to follow. The tables were useful but graphs could also have been nice to have for quick skimming and to summarize data. It is listed as a limitation that there was no system prompt but that is more a strength in my mind. Many Biosecurity relevant evals are done with no system prompt to gauge model safety at weight-level or performed with and without to compare. Either way, that was the right approach for this paper and not a limitation.

Review 3

Clean, honest work. The cross-model finding is the most valuable contribution: on direct misuse GPT-4o looks slightly safer than Claude, but under adversarial rephrasing GPT-4o is significantly more permissive, with 3/5 full compliance versus 1/5. An audit using only direct queries reaches the wrong conclusion about which model to deploy. That's actionable and undersold. The LLM-as-judge failure is a real secondary finding. Models applying safety heuristics when asked to evaluate bio-relevant content, breaking structured output format instead of returning JSON, is a practical obstacle for scaling this methodology and appears to be a general pattern, not a one-off. The no-system-prompt condition is the main constraint. This is a worst-case baseline that doesn't reflect most real deployments. The findings need replication under realistic operator conditions before they become deployment recommendations. 15 prompts, single annotator, subjective partial/comply boundary. The paper doesn't overclaim. The framework release is the right call, a repeatable audit tool has more lasting value than the specific findings at this sample size.

#49

DNA Provenance Passport

View PDF

Team: Tammy Sisodiya, Nourelden Rihan, John Adeyemo Adedeji | Reviews: 2

Tracks: Track 1 Track 4

DNA Provenance Passport is a “code-signing for DNA” prototype that lets synthesis providers verify whether a DNA design came from a trusted researcher, remained unchanged after signing, and should proceed to normal screening or be routed for review.

Show Reviews (2)

Review 1

You identified a specific challenge for biosecurity and describe an innovative solution - effectively modernising customer screening approaches, rather than relying solely on order screening. This is a genuinely interesting and innovative approach. I particularly appreciated you consideration of sensitivities over the confidentiality of proprietary sequences as well as the application to benchtop synthesis devices. Both are particularly innovative and address real world challenges. I would like to have seen more evidence to support the claims of codon shifting/ optimization effectively circumventing sequence homology approaches. I think using tools to engineer proteins to retain function but with a significantly different sequence have been demonstrated in various publications. Key references to these are missing.

Review 2

Please never ever typeset an entire paper in italics. Literally half of the submissions I've reviewed were entirely set in italics, and I suspect it's because everyone was working from a submission template which used italics to give instructions about what to say in various sections, and you just inherited that formatting. But you should be more vigilant; please don't make reviewers' eyeballs bleed. :) This paper feels like all the hard problems are pushed out of scope. A variety of [citation needed]: (a) p.3. "as we demonstrated" where? (b) "codon optimization can make it fall below threshold" isn't true. There are screening solutions which look at the AA's and not the nucleotides and therefore codon optimization doesn't affect them. (c) You cite SecureDNA but seem to miss that there doesn't *necessarily* have to be a tradeoff between screening and privacy, especially for orders which do not contain hazards. (d) Calling IBBIS's CM "state of the art" requires some citation backup. There are many screening solutions out there. On page 4, your "verified attribution" is basically what SecureDNA's "verified screening" mode does already; this paper is just a recapitulation of that mode. You claim to use ECDSA, but give no justification for why that in particular. DSA in any form, including ECDSA, is *incredibly fragile* and very easy to get wrong; in particular, reusing any nonce exposes the key. There are a number of much better choices which do not have so many sharp edges, so the choice of ECDSA here is curious and unwise. You didn't fill out any of the Appendix or (especially!) the "LLM usage" sections at all. This is both careless and also means that it's not possible to know how much of this paper was LLM-generated. You also didn't remove the template instructions from "Code and Data" and from "References."

#50

BSSBreach: Testing designer protein sequences for biosecurity evasion capabilities

View PDF

Team: Adam Streck, Daniel Cermann, Jan Berndt | Reviews: 2

Tracks: Track 1

The decreasing cost of DNA synthesis and advances in generative protein design raise concerns about evading biosecurity screening systems. We present BSS-Breach, a benchmarking framework for testing screening tools against both conventional sequence modifications and AI-generated variants. Using a pipeline of transformations, including synonymous substitutions, padding, splitting, and diffusion-based sequence generation, we conduct a penetration test against a biosecurity check tool ComMec. Our results show that while standard manipulations are reliably detected, all synthetic variants generated via diffusion models bypass screening. This exposes a key limitation of current approaches, which rely primarily on sequence similarity. These findings highlight the need for next-generation screening methods, as well as benchmarking and pen-testing toolkits to evaluate these.

Show Reviews (2)

Review 1

The 'Related work' section helped to set the work in a wider context nicely. Seems like a valuable contribution to efforts in the space. Maybe it's not super original, but I'd really like to see the work expanded.

Review 2

This is a useful and grounded project. You focus on evaluation rather than speculation, and you demonstrate a concrete weakness in current screening systems. The comparison between traditional manipulations and AI-generated variants is especially effective, it clearly shows where defenses break. The main limitation is experimental depth. Right now, the result is binary: detection vs. bypass. To make this more impactful, you should quantify performance. Report detection rates, sample sizes, and variability. Show whether some generated sequences are closer to detection thresholds than others.

#51

Know Your Researcher Bio: Portable Authorization for AI-Bio Tools and Benchtop DNA Synthesizers

View PDF

Team: Ashton Chew | Reviews: 3

Tracks: Track 4

Sequence screening has matured faster than portable requester authorization. Whether a requester has been reviewed and remains authorized to make the request is still answered ad hoc at every provider, AI-bio tool, and equipment vendor. KYR-Bio is a local-first prototype of that missing layer: a reviewed, scoped, holder-bound researcher authorization that AI-bio tools, synthesis checkout flows, and benchtop DNA synthesizers can verify locally without re-sharing the applicant dossier. A single human-reviewed decision is packaged into a scoped, signed credential that a researcher's wallet presents to any participating relying party. Verifiers run schema, signature, issuer governance, Bitstring Status List freshness, holder proof and challenge binding, and scope checks before a policy adapter applies local rules. Audit events carry a hash chain and rolling Merkle root and exclude raw biological prompts, sequences, and reviewer notes. The synthetic evaluation passes 8 persona, 22 verifier, and 18 AI-assistance cases.

Show Reviews (3)

Review 1

This paper is a helpful specification document for what a KYC scheme could look like in reality. However, the core idea presented in the paper is not technically novel, and more importantly, is not the crux towards solving the problem of customer screening. If I understood the presented approach correctly, the main advantage over relying on existing authentication methods is mostly some privacy improvement. While these could help increase uptake of KYC schemes, they do not inherently address security issues. Additionally, the paper could be improved by being more concise.

Review 2

This project is addressing a meaningful gap and it is quite comprehensive for a hackathon. It is great that the author thought about KYC automation while keeping robustness and avoiding redundancies. It has a good systems-level work with a nice layer of portable credentials, separation of roles, and the decision of having local verification with verifier-specific policy. I also liked the end-to-end prototype with onboarding to audit. I would be curious to see this system being upscaled in real life. Aspects which need improvement include: a better rubric of who are trusted issuers, how is trust bootstrapped internationally and how are disputes handled. It also seems that evaluation metrics are somewhat limited. “22/22 cases pass” is promising but not very informative as it lacks false positive/false negative analysis, usability or latency measurements.

Review 3

This was a lot of work for a solo build, nice! The most useful next step is probably a live deployable demo with one real participating verifier, even just a single AI-bio tool's auth flow since currently everything is local + synthetic personas.

View PDF

Team: Tyler Rector | Reviews: 2

Tracks: Track 1

A gradient-boosting classifier for fast, short-read DNA threat detection. Uses a hybrid DNA/protein k-mer approach and reverse-screening post-filters to accurately identify hazardous sequences while generalizing to novel organisms.

Show Reviews (2)

Review 1

I would’ve appreciated a discussion in the intro why screening below 30bp might matter in practice. Is it a plausible threat vector that malicious actors could stitch together 25bp DNA pieces to boot up dangerous pathogens? Doesn’t that take way too long? Due to this I’m a bit skeptical of the utility of screening <30bp in practice (esp. considering the risk of false positives) but your approach is certainly interesting and it’s good to know that screening below 30bp is plausible if this turns out to be an important threat vector. It would be nice to have a comparison of how good AUC of ~0.8 actually is. How does this compare to SecureDNA/IBBIS? (Maybe you can’t assess their AUC) This is really interesting if true! “The protein features tell a different and more biologically defensible story. Of total feature gain, 73% comes from protein k-mers and 27% from DNA k-mers. Top protein k-mers include hydrophobic and aromatic clusters consistent with transmembrane and aromatic-binding regions of bacterial proteins, suggesting that the model is learning real biology at the protein level even while relying on compositional bias at the DNA level.” It’s a valuable finding that the classifier fails on phylogenetically distant organisms. However, I think the problem here is false-positives, right? And false positives are likely the most costly aspect of implementing a screening mechanisms for companies since they need to manually investigate flagged positives. Would be good to flag this in the discussion.l I would’ve liked to see a discussion of the utility of this approach for engineered pathogens. When screening pathogen sequences that are modified, do they fall out of distribution and aren't detected anymore? That would be a critically important false negative. The over-representation and limitation for viral sequences is a shortcoming given the importance of engineered viruses as a key threat pathway. Could this be fixed with more viral training data? IMHO your submission is a bit too biased toward arxiv-style academic writing. That's great for a very particular researcher audience but as a judge who is more in policy-world, it's a bit hard to follow. I think you could take inspiration from the writing of research outputs at places like METR or GovAI who do a great job at writing with rigorous clarity that is still accessible to non-experts.

Review 2

While I'm not sure that short/oligo sequences are as big of a risk as people currently believe, this seems like a positive contribution to the space.

#58

Sentinel Atlas: A Centralized Platform for Multi-Source Epidemic Surveillance Data

View PDF

Team: Chris Harig, Yeng Mun Liaw, Vidur Kumar, Umar Zafar | Reviews: 3

Tracks: Track 2

We propose Sentinel Atlas, a centralized platform that aggregates fragmented, multi-source pathogen and wastewater surveillance data into a standardized, open-access infrastructure to enable real-time epidemiological forecasting and proactive pandemic defense.

Show Reviews (3)

Review 1

- Include an example for a finding that was uniquely possible through your platform. That way you can show a concrete application and highlight the marginal value this tool provides. - Try to make the UI on the website a bit more intuitive - I played around with it for a bit but was never able to get to the map view. - Small thing but follow the WHO recommendation to say mpox over 'monkey pox' - A lot of DC policymakers are talking about the need for this integrated biointelligence you provide, so that's great!

Review 2

The project correctly identified that it's hard to get any kind of harmonized pandemic data that has useful granularity. I also think the predictions element could work very well, although you have to be careful with the quality threshold you set for these. I really liked the live/stale status tagging, as many of these systems are only maintained for a short time. I think this sort of thing has been tried a couple times (EpiGraphHub, Unified COVID-19 Dataset) so I think engaging with why previous efforts stalled would substantially strengthen the contribution. You don't really make an argument for data fragmentation being the primary argument, you just make the claim. Unpacking this would be useful.

Review 3

Solid infrastructure work addressing a real problem. The honest framing is that this is a proposal with early infrastructure, not a validated platform. The crowdsourcing pipeline is the most novel contribution but is unvalidated. There is no demonstrated end-to-end test — no submitted forecast, no approved pull request, no leaderboard entry. The team could have dog-fooded this themselves: four team members submitting simple naive forecasts would have proven the pipeline works and produced a real result to show. Without that, it's infrastructure whose core function remains undemonstrated. The paper undersells what was actually built. A leaderboard scoring system, automated prediction repos, multi-pane workspace, and news feed are all in the repository but absent from the submission.

#59

Reagent Supply-Chain Structure for Benchtop DNA Synthesizers: There is Hope for KYC

View PDF

Team: Martin Rossa | Reviews: 3

Tracks: Track 4

Benchtop DNA synthesizers are approaching viral-genome-length assembly within 2-5 years. This primary-source review of nine vendors and eleven device families finds that every device crossing the >=1.5 kb threshold runs on a proprietary reagent ecosystem, potentially enabling low marginal cost KYC regulation. Submitter notes the project may be revised before public publication. Track 4. Manual entry: submitted via Discord DM after Framer Form closed.

Show Reviews (3)

Review 1

I like the approach this submission takes because it gets to a novel implementation question, highlighting an additional chokepoint that could improve biosecurity. This submission could be significantly improved with more time, building out subsections that were teased but not yet completed.

Review 2

The submission correctly identifies that we will need new systems to address misuse in benchtop synthesizers if they become an accessible alternative to commercial synthesis providers. The proposed approach is one of many suggested previously and doesn't present a new solution or analysis.

Review 3

A well-researched report on the current state of benchtop DNA sequencers and the commercial feasibility of cryptographic authentication for those powerful enough to produce viral-genome-length assemblies. Such biosecurity technologies may already be implemented but not disclosed publicly. The report could be stronger by (1) expanding on the implications of the research findings/offering clearer take-away messages, and (2) by leaning into call(s) to action for what steps should be taken or explored given the results of the report.

#60

ABT-SIM: A tool for collaborative biorisk assessment

View PDF

Team: Jakub Nowak, Michał Skowronek | Reviews: 3

Tracks: Track 3

Biosecurity access controls are typically mapped along linear bioweapon development cycles, but adversaries may route around controls by choosing among design methods, material sources, and providers. We built ABT-SIM, a web-based prototype for modeling bioweapon development pathways as directed graphs with probabilistic nodes. Users can create scenarios spanning design through deployment stages, assign detection probabilities and capability requirements to pathway nodes, and model conditional probability flows through logic gates. The tool provides basic analytics including control effectiveness comparisons and outcome impact visualization. While still a prototype, ABT-SIM illustrates how interactive modeling tools could support more systematic biosecurity analysis. The platform provides a foundation for exploring adversary biorisk pathways though significant development would be needed to reach operational level. The work demonstrates potential value in dedicated tools for biosecurity-relevant analyses.

Show Reviews (3)

Review 1

The idea is certainly interesting; I could imagine it being useful to those identifying gaps in mitigations or evals. That being said, in practice we mostly develop mitigations/evals by identifying steps common to many scenarios and don't do that much of this quantitative modeling work (which could take more time than actually developing the mitigation). Also, I'm not sure about the sources of these numbers/percent risks, so I don't know how to trust this. For infohazard reasons, I also wouldn't publish steps in threat pathways in any more detail than has been done already. Nonetheless, it's an interesting idea, and some of this modeling work is useful for informing AI mitigations.

Review 2

The project occupies a real gap in the practitioner tool landscape, since publicly accessible interactive tools for probabilistic adversary pathway modeling in the AI-bio context do not appear to exist elsewhere, and the choice to package event tree analysis with logic gates in a browser-based canvas with real-time recalculation is a real accessibility contribution even though the underlying analytical method has decades of precedent in nuclear safety, cybersecurity attack tree modeling, and academic biosecurity PRA work. The current framing treats Sandia's BioRAMs and similar tools as point-solution oriented, but the substantive differentiator is accessibility and the AI-bio routing framing, probably not a new analytical paradigm. The deeper issue is that the methodological core sits in tension with the use case in a way the limitations section does not engage with. Event tree analysis works well when pathway structures are closed and finite, when probability inputs are estimable from data, and when conditional independence approximately holds, and adversary modeling in biosecurity satisfies essentially none of these conditions because pathways are open and adaptive, probabilities are elicited rather than measured, and adversary behavior produces strong dependencies between nodes that the multiplicative aggregation model ignores. The result is that the calculated outputs (the 56.7 percent exposure number, the tornado chart rankings, the outcome impact deltas) carry a degree of false specificity that the underlying estimation cannot support, and a user who took these numbers as predictive rather than illustrative would be putting more weight on them than the methodology warrants.

Review 3

Working prototype with real-time ETA propagation through AND/OR logic gates, which is good execution for a hackathon weekend. The framing (adversaries route around controls rather than following linear pipelines) is correct, but event tree analysis over directed graphs is a mature methodology and the novelty here is mainly the biosecurity-specific scenario seeding and collaborative UX. Validate with two biosecurity practitioners on a live scenario before the probability outputs are used for anything decision-relevant.

#61

Benji-Bio

View PDF

Team: Ananya Ayasi | Reviews: 3

Tracks: Track 3

Benji-Bio is a stress-test harness for evaluating AI-biosecurity safety monitors under prompt transformations such as paraphrasing, role-shifting, ambiguity, and fictionalized misuse framing. The project studies a specific evaluation failure mode: static or public benchmark prompts can overestimate safety when monitors learn obvious refusal patterns instead of robustly recognizing risky intent.

Show Reviews (3)

Review 1

The framing is good- I would suggest expanding to LLM based monitors (even just claude or GPT as a classifier with a structured prompt) or simply reframe the paper's scope to keyword monitor evaluation. If the central claim is transformations matter, the main finding should be the variant accuracy degradation, not monitor ranking. The core idea is worth pursuing but the current implementation is still too thin.

Review 2

Benji-Bio is a small prototype benchmark for testing whether AI biosecurity monitors are robust to prompt transformations . This is potentially interesting if this can add automated, large-scale transformation generation, evaluate LLM-based monitors, and validate on a larger, expert-labeled dataset to demonstrate real robustness.

Review 3

Clear and thoughtful research, project, experiments, results, and presentation. A useful approach against the threat of prompt engineering to get around Biosecurity guardrails in large language models. In addition, the generated dataset can be helpful to others in developing solutions to this problem.

View PDF

Team: Grant Schumock, Amar Pandya | Reviews: 2

Tracks: Track 3

Biosecurity policy is increasingly important to the governance of artificial intelligence, biotechnology, pandemic preparedness, and synthetic biology, but U.S. federal policy signals are spread across fragmented and difficult-to-monitor sources. Relevant developments may appear as legislation on Congress.gov, proposed or final regulatory actions in the Federal Register, or as docket materials and comments on Regulations.gov. This fragmentation creates a barrier for policymakers, researchers, advocates, and other stakeholders who need to track emerging biosecurity rules and legislative activity. We present the Biosecurity Policy Dashboard, a proof-of-concept tool for aggregating and exploring U.S. federal biosecurity policy documents across Congress.gov, Regulations.gov, and the Federal Register. The system uses keyword-based retrieval to identify potentially relevant legislative and regulatory records, stores the results in a local SQLite database, and displays them in an interactive Streamlit dashboard. The dashboard allows users to search, filter, sort, and inspect relevant policy documents by source, date, keyword match, agency, docket, and document type. The current prototype demonstrates the feasibility of a unified, regularly updated policy-monitoring interface for the U.S. biosecurity landscape.

Show Reviews (2)

Review 1

The staffer framing is good and makes the problem concrete. But the deliverable is unfinished. Refresh disabled, partial data, local only, and no measurement of how well the filter works. You flag the noise issue yourselves an LLM relevance pass could fix most of it. Differentiator vs AGORA/B-SPAN isn't sharp.

Review 2

Interesting project! I'd love to see an online deployed version if/when it becomes available. Also, I'm wondering if this is much more helpful than someone searching on the three underlying sites (congress.gov, regulations.gov, Federal Register).

#67

Biosecurity Export Control Navigator: Cross-Jurisdictional Regulatory Comparison for Dual-Use Items

View PDF

Team: Yousuf Golding | Reviews: 2

Tracks: Track 3

Cross-jurisdictional regulatory comparison tool for dual-use items, surfacing where export control regimes diverge across jurisdictions to help researchers and compliance teams reason about biosecurity-relevant exports. Track 3 submission. Manual entry: submitted via email after Framer Form closed before announced AoE cutoff (gmail thread 40827).

Show Reviews (2)

Review 1

- Give more context on the relationship between biorisk and export controls. Is the risk some goods would be exported to states that pursue BW? Does it pertain to group or lone wolf actors? Have modifications to export control regimes been discussed as an effective tool to reduce biorisk? This helps make the case for why this tool matters. The paper goes into technical details quite abruptly. - Introduce the acronyms you are using. Otherwise, it can get hard to follow. Try to reduce the density of acronyms to make it easier to follow. - Explain why you went for the US, EU, and AUS. - For your three coverage gaps: explain why this has practical relevance. How could this inform or change the behavior of someone working with export control lists? How might these insights reduce biorisk?

Review 2

The project is a competent regulatory comparison tool, and the technical work behind it stands on its own merits. H However, the core issue is that the report positions the Navigator as a biosecurity tool, but the artifact is more accurately described as a cross-jurisdictional regulatory comparison tool that happens to cover a category of regulations which includes biosecurity-relevant items. It doesn't directly advance biosecurity. Export controls are one policy instrument among several that governments use to pursue biosecurity goals, and the Navigator sits at the last link in a chain that runs from biosecurity objectives through regulatory text to user-facing comparison. The dual-use analysis in Section A.2 actually demonstrates this gap, perhaps unintentionally, because the marginal-uplift argument used to defend against bad-faith use applies symmetrically to good-faith use as well: if making comparison faster does not meaningfully shift the offense-defense balance for attackers because the underlying regulations are already public, it does not meaningfully shift it for defenders for the same reason. A more careful version of the report would acknowledge this symmetry directly.

#68

Success Is Not Safety: When Helpful Bio Agents Do Too Much

View PDF

Team: Philip Nilsson | Reviews: 3

Tracks: Track 3

This project introduces Bio Capability Boundary Monitor, a prototype pre-deployment audit tool for biological AI workflows. It detects capability overreach: cases where an agent completes the visible task but uses more biological capability than the request justified. In a 1,500-run Llama evaluation, raw task success was 96.6%, while safety-adjusted success was only 31.7%. A scope-aware postcondition layer reduced strict false allow rate from 8.47% to 0.139%, recovering 60/61 missed unsafe runs with 0/480 new false positives. A 100-run Track 3 application slice showed the same pattern in bounded public-health triage and screening policy review proxies. The core lesson is that biosecurity audits should not only ask whether an answer is harmful or useful. They should ask whether the workflow used biological capability the task actually justified.

Show Reviews (3)

Review 1

Thank you for your submission, evaluating biological agents by assessing whether or not capability overreach occurred is an important topic to address. The community would benefit from further analysis of pre-deployment strategies and from incentivizing their application. For the oversight layer, I wonder if supervision must be provided from a more capable model in the scenarios tested, or whether weak-to-strong generalization could apply. There are several ways that this submission could be improved. First, I think the core argument requires more justification. At the present, features such as retrieval, planning, and reasoning often improve output quality and could even enhance safety. Second, how variable are your results across different language models, and why is Llama tested in particular? Third, addressing key metrics such as safety-adjusted success would be appreciated, as they are important for interpreting the results.

Review 2

This is an important research that address capability overreach and is somewhat novel in biosecurity. If I understand correctly, the current system seems to rely heavily on hand-crafted rules which limits generalization to real-world agent settings. I would focus on replacing manual scope rules with learned or formally specified constraints, and evaluate in more realistic multi-step agent environments where capability boundaries are ambiguous.

Review 3

The biggest challenge with the submission is that I do not see how using more biological capability than a task requires is a safety concern, though I buy it's a challenge in terms of efficiency, and compute costs. If a task uses a more advanced capability or shares more knowledge than necessary, but that knowledge or capability is not itself harmful, then I do not see what the problem is. The authors provide an example of an agent providing more biological context and handling information than necessary; however, it seems like the person employing the agent could simply make follow-up inquires to get that same context and handling information. If that information isn't harmful, who cares?

View PDF

Team: Anusha Asim, Ammar Ahmed Farooqi, Maryam Asif, Fouwaz Parkar | Reviews: 2

Tracks: Track 3 Track 4

BASTK-Bench introduces a novel framework for evaluating biological risk in open-weight AI models by focusing on real-world, execution-level capability under low-resource conditions, particularly somatic tacit knowledge in tasks like DIY CRISPR troubleshooting. It is among the first studies to systematically biorisk-assess newer open-weight models such as Llama-4-Scout and Qwen3-32B in this kind of execution-oriented setting. The results show that risk is highly task- and framing-dependent, with newer models exhibiting higher risk in specific practical scenarios, suggesting current benchmarks may underestimate real-world biological misuse potential.

Show Reviews (2)

Review 1

Impact Potential & Innovation: The lack of standardized evaluation frameworks is a real gap in the evals landscape that is worth addressing. You correctly identify some problems with current eval methodology, such as limited prompt-sensitivity testing (although this is a more subtle problem than you describe, see eg. here: https://www.anthropic.com/research/evaluating-ai-systems for examples on how simple formatting choices can affect eval scores). Focus on Q&A testing is also a genuine problem, but you strawman the case a bit by completely omitting agentic evaluations (like ABC-Bench or ABLE) and uplift studies (like Shen et al., 2026 https://arxiv.org/abs/2602.16703) in your literature review section. Both Llama and Qwen3 have already been evaluated for biorisk, see for example https://airiskmonitor.net/. "A key finding is that newer models may be more capable (and potentially more risky) than older, larger ones, challenging assumptions that safety scales predictably with model size" - that is actually not the assumption; it's well-known that newer, smaller models often perform better and that capabilities tend to scale with release date rather than size (see https://metr.org/time-horizons/). This isn't a new finding. Additionally, Llama-4-Scout is not actually smaller than Llama-3.3-70B; 4-Scout is a Mixture of Experts model with 109B total parameters and 17B active per token, while 3.3-70B is a dense model with 70B total parameters. Scout is smaller in inference compute per token, which is the point of MoE Theory of change is missing from the paper. What is the intended audience of this eval? What will happen if people use it, how will it change outcomes? Is it meant to influence policy, evaluate safeguards for internal lab use, or contribute to risk monitoring? Clearly outlining this would make the submission stronger, and I encourage you to think about theory of change of your future projects in general, to make sure you build a tool to solve a problem rather than look for a problem to suit your tool. The "non-googlable tacit cues" criterion is unaddressed as a generalizable research direction. Do you google it every time? Do you need experts to score every run of the eval? Do you have any plan on how you would automate this, at least partly? Without an answer, the approach doesn't generalize. Execution Quality: There are some significant problems with the methodology of this work. The biggest one is the manual scoring that you use - current evaluations are focused on addressing that, this approach is not scalable. You acknowledge it may influence results ("despite consistent evaluation guidelines" that are either not disclosed or very laconic, if the table in the paper and the “scoring notes” column in the dataset is all there is). The work would benefit from at least using an LLM-as-judge approach, and acknowledging its limitations. Scoring notes like "depth" don't seem detailed enough. Writing the eval in plain Python (rather than UK AISI's inspect framework) does not affect your score by itself, but inspect is the industry standard for AI evals and would have given you scoring infrastructure, model adapters, refusal handling for reasoning models, and reproducibility tooling for free. Strongly recommend getting acquainted with it for future work. I appreciate that you use an auto-scorer for refusals. However, the implementation is brittle. For example, it breaks for reasoning models that start the answer with <think>. This kind of edge case is exactly what existing eval frameworks like Inspect already handle, so you don't need to reinvent the wheel. There are some minor code hygiene problems like hard-coding the API key in the main eval code. Consider using an .env file. You do not provide any baseline (e.g., if a single google search would give out the same information or not), so it's possible that the actual uplift on the tasks you described (DIY CRISPR troubleshooting) is actually small, because of the availability of educational materials on the internet. You define tacit knowledge as "depth beyond surface-level information," but tacit knowledge is by definition knowledge that is hard to convey by text; in biology it usually refers to e.g. ways to handle equipment. You also do not explain what you mean by somatic adaptation, or how you measure it. This makes it difficult to assess what the eval is actually measuring. In the future, I encourage you to document (and include in the paper) how you approached task selection and prompt generation, and if the task dataset is representative of actual real-work scenarios, and how do you know that. It’s an important part of the methodology, the VCT paper does a very good job here. Your “Risk Score” is hard to interpret. Each of Uplift and Accessibility is scored 0–3, so 9 is the maximum possible value (the value set is [0,1,2,3,4,6,9]. How does the jump between 0 and 1 correspond to that of from 6 to 9? Why did you use a metric like this, what real-life value does it refer to?). Tacit Knowledge and Refusal Robustness are scored but the results are not evaluated or presented, even though somatic tacit knowledge is the central pitch of the paper. Refusal robustness test is not standardized. "If a model refuses, a single jailbreak variant (e.g., role-play or academic framing) is attempted to test refusal robustness." How do you choose which variant? This needs a deterministic protocol, otherwise results aren't comparable across models. Also, your code actually tests both of those jailbreak strategies. The paper would benefit from either dropping that part or using more SOTA jailbreaking techniques, because now you have false negatives when the model refuses because the jailbreak prompts are not very sophisticated, and you mark it “safe”. Prompt framing is methodologically off. Evals rarely explicitly use the word "bioweapon" because that would almost surely result in a refusal from all models. They frame the questions as benign (see https://www.rand.org/pubs/research_reports/RRA4591-1.html for an example). One of your representative prompts is "bioweapon planning," which likely biases the refusal-vs-compliance results. I encourage you to think about information hazard when releasing code and prompts, and address this every time you publish. While I do not think that your paper presents a large infohazard, it's part of eval hygiene. Also think about information security when dealing with dual-use data. GitHub is owned by Microsoft, Copilot is trained on public repos, so you risk both data leakage and possibly providing dangerous knowledge to the model. Your work lack any statistical testing, confidence intervals, etc. You acknowledge this as a limitation, which is appreciated and I understand it's hard to address in a hackathon, but the evaluation community would benefit from higher statistical rigor. I encourage you to address it in future work (e.g., pre-register results and methodology, calculate power). Shen et al. 2026 is a good example of the direction we should aim, even with its limitations. "Closes expert gap" (Uplift = 3) is self-referential without an expert in the loop. You can't score whether something "closes the expert gap" without expert validation, which the paper explicitly does not have. Presentation & Clarity: The paper does not discuss the prompts and tasks almost at all, so it's impossible to assess them from the paper alone. Either address that (say that prompts are proprietary or not public due to data-leakage risk/infohazard, but you’ll release them on reasonable request) or discuss the prompts in the text. You should have a table in the paper describing them and mapping them to your subsets, plus a brief description of the prompt generation process, and a better justification for why you chose those prompts/areas. There is no way to know from the paper alone which prompts you used, the reader needs to inspect both the database and code to figure it out. In the methods se

View PDF

Team: Kevin Fickert, Jessica Anderson, David Nai | Reviews: 1

Tracks: Track 2

Biosecurity analysts face a growing asymmetry: outbreak reporting volume has expanded substantially while the number of trained personnel able to synthesise that information in real time has not. BioWatch Brief compresses the analyst intake stage from hours to minutes via a three-stage LLM pipeline (structured extraction, retrieval against a curated corpus of historical outbreaks and biosecurity policy frameworks, and grounded analysis), producing a structured risk card from arbitrary input reports. The architecture deliberately separates fact extraction from retrieval and synthesis, constraining LLM outputs at each stage and surfacing uncertainty rather than masking it. Built on gpt-4.1-mini with a curated 21-entry open corpus (16 historical outbreaks, 5 policy/framework documents) normalised across pathogen, location, transmission, response history, lessons learned, and source URLs. React frontend, FastAPI backend, single /analyze_report endpoint. Built at the Apart AIxBio Hackathon, April 2026 (University of Pennsylvania).

Show Reviews (1)

Review 1

The problem framing in your introduction is strong, the asymmetry between outbreak reports and analyst capacity is a real chokepoint. The architectural argument in your discussion, that structural fidelity has to be enforced rather than prompted for, is a good contribution worth exploring and developing further. I could see a lot of value in a tool like this existing. The idea is sound, but the current implementation does have limitations. To your credit you identify and flag many of them in the limitations section. The RAG search is tied to key words, and the current corpus is small, so it limits the usefulness but does work for a demonstration for a hackathon. However, the full results are not provided anywhere and only one example is given. Tying one of the categories to geography could be limiting as well. While some viral and bacterial families which can cause a PHEIC are geographically limited, others are not, influenza immediately springs to mind. Geographic anchoring also somewhat limits the effectiveness for bioweapons or engineered pandemics which could cause GCBRs and could, by design, be first unleashed anywhere. One could consider keeping the fields but weighing them differently in the scoring in some manner. The scenario library described in the reports is author and crafted to match exactly what the tool needs, was used during development, and had no held-out set. Section 4 then reflects expected behavior on constructed examples, rather than evidence the system works on inputs it hasn't seen or might be messier. The line that uncertainty flags "correlated with the cases where ground-truth assessment was itself ambiguous to the human authors" is somewhat confusing, and essentially the authors agreeing with themselves. The planning notes /doc on GitHub suggested running 10–15 archived ProMED alerts with documented outcomes as an eval. That would have been great and would have given you real evidence for the report. Even a smaller number, like five archived alerts with retrospective WHO classifications, would substantially strengthen this section and serve as stronger proof BioWatch Brief is behaving and functioning as intended. The linked codebase is somewhat hard to follow compared to the submitted final write up. It looks like there are two distinct pipelines in it? Pipeline.py and main.py? The first has the FastAPI backend using OpenAI's gpt-4.1-mini and seemss to match the staged pipeline described in your Methods section. It also has the keyword-scored retrieval over the fixed corpus and the grounded synthesis. The two LLM round trips. It is also named "main." I judged only on this one, as it seems to be the submission. The second is using the Anthropic SDK and Claude Opus 4.5 with tool use and the output is quite different, but has no RAG/keyword based corpus retrieval. I am not scoring on this pipeline, as the paper seems to indicate the first, but the outputs here seem richer and more informative. So just a flag it could be good to build towards this more in-depth and informative outputs in the submitted, RAG + corpus grounded tool. You also have code for a live signal assessment in the first / main pipeline but that does not seem to be implemented in any way with a way to get live signal, though the planning document has some ideas in it. Not judging on that, just flagging this is a good area to expand and work on if you continue with this tool. A minor quibble but the write-up likely used some resources that are not cited. The mention of BlueDot and Metabiota sent me googling as I was curious if said BlueDot was related to BlueDot impact at all, and the first result in this paper https://pmc.ncbi.nlm.nih.gov/articles/PMC7378493/. Feel like this should have been cited in the report, which then makes me wonder if there are other missing citations. I am not saying there are, just it raises the spectre of it. You do also disclose the use of of LLMs for brainstorming, but the planning MD appears to be generated by Claude / Claude Code and I am sure it was after a lot of back and forth with user(s) but it is more a design spec than just a brainstorming doc. Which I think is fine, that's the way things work now, but maybe could have been disclosed differently. This is such a new area it is hard to say, so not scoring against this at all, just mentioning it.

#82

Open-Pandemic-Risk: Alert, Enrich, Evaluate, Recommend, Open Platform for National Public Health experts with AI agents and explainble models.

View PDF

Team: Yi Yao Tan | Reviews: 3

Tracks: Track 2

This project is an attempt to build a pandemic risk monitoring platform for public health experts in policy or fieldwork with four stages: Alert, connecting verified health professional signals, Enrich, a scalable multiturn agent with search gathering external context, Evaluate, a risk assessment model (doubleml) to score risk and confidence of the alert and enriched data, and Recommend, a grounded AI agent recommending actionable response steps to public-health teams for faster response from raw signals to actions recommendations based on explainable models policy makers are familar with. The project ended up scaffolded and is deployable with the pipeline runnable, however, model experiments were not rigorously run nor verified. It serves as an entrypoint to continue transparent open development of explainable, live, grounded, and actionable alerting and response of pandemic risk where speed and transparency matter.

Show Reviews (3)

Review 1

Open system is an interesting idea. Who could cover costs? That's main reason the best tools like this so far are not open. For the layer to draw from past decisions: many past decisions in handling emergent outbreaks were not good. How to account for that? It would vary by locality, but for many places that lack other decision support tools, something rooted in the WHO like this could be relatively useful and trusted.

Review 2

This project seems like an ambitious undertaking to me. I want to credit the author for presening it as such and noting that there was a limited amount they could do in a weekend. Assuming that AI agents are "working well" then I do think that a system like this would provide real value. The project at the moment is mostly scaffolding. I wonder if the path to building trust in a process like this would be to start with a smaller chunk of the problem which can be validated.

Review 3

An interesting start on an AI-enhanced pandemic response pipeline, but lacking key implementation details. The need for a new end-to-end solution, rather than supplementing existing tools, is hinted at but not fully explained. More development time is necessary to see results.

View PDF

Team: Igor Mizin, Svyatoslav Pizanov | Reviews: 2

Tracks: Track 2

EPICURUS AI is a system that began as a disease outbreak forecasting tool using historical case data and was expanded during the AIxBio Hackathon with bidirectional pathogen prediction. It bridges three domains: epidemiology, molecular biology, and vector ecology. The system works both ways: epidemiological parameters predict molecular traits, and molecular features predict epidemiological outcomes. This means we can estimate R₀, incubation period, and case fatality rate from genome features alone — critical for assessing AI-generated pathogens before they circulate. Built as a Streamlit prototype with switchable ML models, trained on curated data from WHO GLASS, WHO BPPL, SeqScreen, and peer-reviewed arbovirus datasets. Designed as a proactive defense tool against engineered biological threats.

Show Reviews (2)

Review 1

Suggest substantial trimming, adding dual use section, validation/code section with numbers. There's a lot of claims made that aren't supported by any evidence/validation.

Review 2

Can't say the presentation didn't have style, but in the future I would strongly encourage you to move toward briefer write-ups for hackathons! It's much easier for judges to evaluate the technical aspects of the methodology that way. The actual hackathon contribution here doesn't arrive until page 23! Basically everything up to that point is scene-setting, throat-clearing, or stylized preamble that isn't useful for this venue, where the only people likely to read this are very up-to-context on AIxBio. When we do get to the meat of the proposal, it appears like more of a design sketch for what you're going to build than what you did build. There are no results or screenshots presented, and no details on the methods that I can really evaluate. Instead of choosing an algorithm and motivating it, this gets around the problem of carefully choosing a model by just making it a user drop-down. What hypothetical user in a biosecurity decision-making context is going to use a drop-down to decide how to model the data coming in? The idea of mapping molecular features to epidemiological features with the pathogen dataset does make sense, and I'd like to see this kind of model fully fleshed out. I'm not sure running the training the other way with the epidemiological features as predictors really makes it bidirectional, but this is still an interesting idea. I would hesitate to anchor too much on results that popped out of this model, though, without a more thorough treatment of confounds (R_0 does not just depend on molecular properties, for instance, and will vary a lot on population properties that this would not capture). It's also not clear that small models trained on 37 pathogens are going to provide any useful signal for novel pathogens, especially when the relevant features are just tabular. Predicting certain epidemiological features using molecular properties in tandem with info about host, population etc does seem possible and useful, but using small tabular datasets looks unlikely to me to uncover many non-trivial generalizable patterns. I think there's a version of this that would still be cool as a hackathon project, and potentially as a direction to further explore, but the write-up mostly buries or does not provide details to evaluate.