We identify AI-redesign risks in several existing screening tools and mitigate them with our own AI model, Capiti. We built a physical device which supports our model for edge AI deployment and demonstrate physical interruption of emulated synthesis runs.
Review 1
This is, for a hackathon, overall exceptionally strong and interesting work. Very impressive to have accomplished so much in the timeframe and even built a hardware prototype. Great work. The tests in the paper are very interesting and it is genuinely novel to think of AMR as an unexplored threat vector. The finding that commec misses ~95% of that work is a strong finding, and you identify it is because they do not consider it a threat and leave it out of their libraries. But this is a gap and one you highlight with empirical work, well done. You also do well in citing a specific threat vector, a compromised machine, and are aware in limitations that this does not prevent misuse of a machine a hostile actor may own. Though I do like the discussions of ways to possibly explore that further and think it is worth doing so. The findings that Capiti catches more 'functional variants' is also very strong, if it holds up that these are functional variants. I understand time and compute restraints in a Hackathon prevented the structural work, and do agree actually doing in vitro testing is not a good idea, but having a better understanding if these are functional or not could strengthen the work. Without it could be Capiti is flagging benign variants that other methods properly pass. So this could complicate the FNR comparisons. A good area to follow up in. The false positive rates deserve more attention than they get in the write-up. Capiti-E shows 2.93% FPR and Capiti-C shows 6.60%. Those sound low but in a research context could lead to legitimate pushback. Figure S2 shows Capiti underperforming on alanine scanning knockouts at 64% accuracy without the gate, and this is the type of work that legitimate research often involves. Knocking out active sites etc. The gate ups the performance but the text "Capiti gate is a small model modification that can improve performance on ala_scan, by forcing probability to zero when active-site mutations are identified by the model" would imply knowledge the active sites a priori, which might limits generalizability. The negative controls were interesting and useful, and gets after the question of sequence v function, but also including a set of truly benign sequences from all manner of synthesis orders or biological organisms would be interesting to see the performance comparisons. I am also specifically thinking of molecular mimicry here. Some pathogens are under selective pressure to have certain proteins appear similar to host proteins, thus limiting immune responses against them. Would some of the host proteins get flagged? "Intriguingly, there is substantial variability: some proteins can be recognized early, whereas others need much more context before Capiti can reliably call them" This is intriguing, worth thinking about more in the future. I could imagine molecular mimicry also coming into play here, a pathogenic protein that looks like a benign protein may be harder to call early. The dual use statement might be a little weak. Even describing how to generate the variants could be considered an attention hazard. Also Capiti itself would be a dual use concern, no? If one used it to identify which sequences would not be caught and then order those for synthesis. Overall though, really impressive and great work and the comments above are intended to help with thinking through next steps and strengthening the project.
Review 2
The problems you addressed and their framing were clear. You achieved quite a bit in 2-3 days for a proof-of-concept: targeting desktop synthesizers with locally embedded screening, and also including LARGs as neglected sequences of concern. While it was mentioned a few times as a screening tool, I was curious if SecureDNA was not an option due to time/access, as it would have been good to see its results against your testing data set. I found your list of limitations and considerations to be rigorous, and I agree that validation with ESMfold would be a good next step. While you mentioned how Capiti may need to be physically integrated to prevent tampering, I think it would be useful to consider how such a system would be updated over time, and if that would expose vulnerabilities. And for a dual-use concern, could emusynth be deployed within a synthesizer to spoof valve control signals to mislead a Capiti-like system?
Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.
Review 1
What an exciting project! I learned a bunch by reading it. I have very little to critique here - the write-up was easy to follow (probably could have been somewhat shorter), the results were impressive, the limitations were stated very clearly. Good work.
Review 2
This is an important piece of work in the sequence screening space to address potential vulnerabilities in oligo synthesis. While there are many existing tools for in silico assembly from contigs (e.g., https://edinburgh-genome-foundry.github.io/DnaCauldron/), this is one of the few works that combines the approach with a practical implementation of screening. Since PCA was first developed, many other assembly methods have been developed that use different techniques and short overlaps or none at all. OliGraph would benefit from supporting a larger set of genetic engineering techniques for assembly.
Review 3
This is a strong submission. The problem is real, the gap is well-documented, and delivering a working open-source tool that closes it is a legitimate contribution. The assembly logic is technically sound — contig reconstruction, threshold precision, and signal separation from background noise all perform as described. The paper is unusually clear for a hackathon submission, and the dual-use appendix reflects genuine biosecurity literacy. The natural next step is validation on real commercial order data. Self-generated test sets are a reasonable starting point for a hackathon, but the path to regulatory relevance runs through synthesis providers. Getting one or two to run OliGraph against their actual order pipelines would transform this from a promising proof of concept into something policymakers can cite. The EU Biotech Act angle is well-positioned — following through on provider coordination would make that connection concrete rather than aspirational. On deployment, the paper frames OliGraph as a tool for synthesis providers but doesn't address how it fits within existing screening infrastructure. The cross-provider evasion problem is acknowledged but quickly set aside as a shared limitation. It deserves more than a footnote. A more complete deployment picture would include integration with customer order logs to flag repeat fragmented purchases across time. Current frameworks require providers to conduct both sequence screening and customer verification — knowing who is ordering, not just what. Extending that to assembly-aware oligopool screening would require policy consideration around flagging criteria, escalation thresholds, and what triggers a hold on an order. That operational layer is absent and worth developing. The more interesting tension is with cryptographic screening like SecureDNA. In cryptographic screening, sequences never leave the provider's premises, and the screening service never sees them in plaintext. OliGraph requires the opposite: you have to reconstruct the full assembly in the clear to know what the pool encodes. These approaches are complementary but in fundamental tension. Making assembly-aware screening privacy-preserving is an open problem; multi-party computation over graph structures is an active research area, but has not been applied in this context. Engaging seriously with this would be the most significant contribution a follow-on version of this work could make. Two practical issues worth fixing before wider deployment. The web interface doesn't expose the PCA mode toggle — the most biosecurity-relevant feature requires CLI access. And when the tool returns zero results, it does so silently, with no indication of whether the pool is clean or whether a parameter needs adjusting. For the non-specialist operator this paper envisions, that's a dead end. One framing clarification: OliGraph is an assembly engine. Detection happens downstream in BLAST. Being precise about where OliGraph ends and screening begins would sharpen the contribution and set more accurate expectations for anyone building on this. Solid foundation. worth developing further
This project is trying to solve the problem of split-order screening, which is one of the biggest current vulnerabilities in the DNA synthesis screening. We present an open-source project for benchmarking and evaluating obfuscated split-order detection methods. Our central finding is that split-order detection is technically tractable even under realistic obfuscation.
Review 1
Solid benchmark, and the minimap2 result is useful! Two things I'd push on: - The "tractability" headline feels stronger than what you actually showed. You proved single-pool detection works under your obfuscation set, which is great, but the discussion itself flags cross-vendor correlation and AI-designed variants as the real unsolved problems. The framing would land better if it matched the discussion! - FPR on benign research pools is the missing piece for me. F1 is great but a synthesis company's first question is "how often does this flag legitimate orders." Adding that sweep would make the work way more useful to an actual deployer. The Evo 2 angle in your future work is the one I'd love to see next.
Review 2
This is a well-done hackathon-scale project: clearly and tightly scoped, well positioned relative to other work, succinctly but precisely described, and clearly useful to others. My primary quibble is that the connection to benchtop synthesizers is not clear to me. Split orders are normally discussed as a way to avoid the screening done by commercial synthesis companies; AFAIK, benchtops don't (yet) have screening at all.
Benchtop DNA synthesizers have no mandatory screening today — and existing tools have a structural blind spot: they evaluate DNA fragments individually, missing split-order attacks where multiple innocuous-looking fragments assemble into a sequence of concern. We present the Fragment Assembly Risk Scorer (FARS), a prototype on-device detector that scores orders by collective assembly potential rather than individual fragment identity. Tested against 960 simulated orders across three real 1918 H1N1 genomic segments from NCBI GenBank, FARS achieves 100% detection of high- and medium-coverage split orders with zero false positives. Most importantly, on partial split orders, FARS detects 80% compared to 40% for an individual-sequence baseline modeled on IBBIS commec's methodology, doubling detection while eliminating false positives entirely. The 20% that evade local detection define a precise empirical boundary motivating shared cross-device infrastructure. Our head-to-head comparison is the first empirical quantification of what assembly-awareness buys in split-order detection on real genomic data — and directly characterizes the gap left by S.3741's mandated screening approach.
Review 1
I liked the well-defined scope and clear result of this submission. While a discussion of the limitations is present, I would have appreciated a more in-depth discussion on the implications of these. For example, the limitation of AI-enabled design is flagged but not really grappled with. For example, the results section could have discussed what the detection boundary would look like against codon-optimized or AI-designed sequences. Additionally, while a great technical paper, the submission could have benefited from a summary in simple terms so that educated people in the field with no specific knowledge of synthesis screening could better grasp the problem and the corresponding solution.
Review 2
Clean writeup, single question asked and answered, impressive for a solo project. 80/40 table lands the point fast, and the policy hook (33%/11% as the line where local detection breaks) is a smart frame. Weakness: It's all simulated orders + only point mutations as the evasion model. You flag both honestly.
AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.
Review 1
The most interesting finding here (AI-designed Munc13 binders accidentally resembling neurotoxins with no toxin input) is buried at the end. Should probably lead with it. 24/24 catch rate via DALI is great but presentation is a weak spot: figure wrapping in related work breaks the read, and the 20-page appendix has no summary. This is honest about false positives.
Review 2
Functional/structural screening is very important and it seems you made real progress toward that, kudos! I think using ricin as a central example makes sense given that it's a favourite among would-be-terrorists. However, I wonder whether there is some cherry-picking where certain proteins perform unusually well in your pipeline, maybe because they are well-represented in the dataset or similar - whereas your pipeline would struggle to detect functional/structural analogues in other proteins (?). This is just a guess, I am not an expert in this specific area. It would also be interesting how this pipeline performs on viral proteins such as coronavirus family spike proteins or influenza H/N proteins. More infohazardous of course! "a risk acknowledged by the October 2026 revision to the OSTP Nucleic Acid Synthesis Screening Framework, among other policy memos" --> either a typo for the year or LLM hallucination. You mention the pipeline is bottlenecked by compute resources for use at scale. I'd be curious to see cost and speed comparisons with sequence homology screening. This is asking too much for a hackathon submission but if this was a full article I would like an explanation of why exactly these values were chosen and what they actually represent: "DALI similarity scoring was determined if a protein matched a toxin with a z score greater than 10, an identity score greater than 30%, and/or an RMSD less than 2. MMseqs2 similarity was determined if the identity score was greater than 90% than a known toxin sequence. FoldSeek similarity was determined by an identity score greater than 30%, RMSD below 2, e-value below 1e-4, and/or TM score greater than 0.9." The Limitations and Future Work section makes lots of good points. I would have appreciated a short discussion on next steps of how false positives and false negatives in your pipeline could be reduced. IMHO your submission is a bit too biased toward arxiv-style academic writing. That's great for a very particular researcher audience but as a judge who is more in policy-world, it's a bit hard to follow. I think you could take inspiration from the writing of research outputs at places like METR or GovAI who do a great job at writing with rigorous clarity that is still accessible to non-experts. Hence slightly lower points for presentation/clarity but your technical contribution is fantastic.
Review 3
An impressive amount of work covering a known gap in biosecurity. While this is similar work going on in this area, this is a nice piece of work that strings together several public tools into a robust demonstration and proof of principle. The paper is clearly written with a strong ToC and the table is nice and very information dense. A small legend on the second table in the appendix indicating what the colors meant would be nice to have, but minor. For a weekend hackathon, the breadth of the testing dataset is very impressive. Testing against a panel that includes natural toxins, benign structural mimics, amyloid proteins, de novo ricin variants, and AI-designed Munc13 binders demonstrates a deep understanding of the problem space and a very rigorous examination of the screening pipeline. The truncated de novo ricin sequences to simulate a partial-synthesis evasion attack was a strong inclusion. Proving that sequence alignment caught 0/24 of these while your DALI pipeline caught 24/24 is a great way to highlight the importance of moving beyond homology methods. You do well in discussing the false positives and current limitations, flagging that host proteins co-crystalized with toxins get called, and you lay out future directions clearly and with a very strong understanding of the problem space and ongoing work. The tiered design is clever. Using MMseqs2 for fast, low compute initial gating before moving to computationally expensive tools like AlphaFold2, FoldSeek, and DALI is practical and logical system design. But, and you touch on this, running AlphaFold2 and DALI on every single sequence that bypasses MMseqs2 is currently too computationally expensive for commercial scale adoption. But that may just be a problem that resolves as compute gets cheaper. "Finally, trypsin (14), concanavalin A (19), and thrombin (22) were confidently flagged as toxic across all methods, raising a broader question that this and any biosecurity screening tool must explicitly address: what threshold of danger justifies synthesis restriction? The distinction between "toxic in some biological context" and "dangerous enough to warrant screening" is not currently defined in any of the databases used here." - Excellent point. These aren't really biosecurity threats, and a smart synthesis screening system should not flag them, but I agree that is broader work and a bigger question than within this project and for a hackathon. Future work could focus on a few areas. To solve the compute bottleneck, potentially inserting an intermediate machine learning step between MMseqs2 and AlphaFold2. As you mentioned in the report, using a lightweight classifier built on embeddings from a protein language model (like evo2 or ESM-2) could quickly filter out non-homologous benign proteins, reserving AlphaFold2 strictly for highly suspicious sequences. A good path toe xplore further. Moving away from binary "toxic/non-toxic" outputs. Work developing a preliminary heuristic or risk score that considers the exact structural hit and flagging an active site match for botulinum neurotoxin as a 'critical flag/block' while flagging an overexpressed protease active site as 'requires manual/human review' mimicking some current screening approaches. As LLMs improve one could also think of sending it for a review step to one to then send their summary along to a human reviewer. In future presentations, include a 3D structural overlay (e.g., in PyMOL) that shows how your de novo ricin aligns with the active site of natural ricin despite zero sequence homology. Visuals make structural bioinformatics much more accessible to policymakers, the public, and generalist judges etc. Overall, really great work!
OmnyraCloud is protocol level biosecurity screening for cloud lab workflows. Today's tools screen DNA sequences at order time — but cloud labs run workflows, and a chain of individually benign steps (serial passage, split orders, surface obfuscation) can pursue a dangerous objective without a single flagged sequence. That's the gap we close. OmnyraCloud ingests any lab protocol (Autoprotocol, Opentrons, JSON, or free text) and runs a 5-stage pipeline: decompose the workflow → score 5 risk dimensions → ground every flag in retrieved biosecurity literature → audit with LLM-as-judge → cross-check sequences via IBBIS commec. Output: an auditable risk report with citations, not a black box. IBBIS flagged 1/3 dangerous protocols. Two sequences were screened but returned no HMM matches, but protocol level reasoning caught all three. Protocol level screening isn't just complementary to sequence screening. It's essential. Live at https://omnyra-cloud.vercel.app/
Review 1
Fantastic idea. Best of luck with Omnyra. As you said, the data size was sufficient for a proof-of-concept. This could have actual legs once a statistically significant number of protocols are evaluated.
Review 2
I'm not convinced that cloud labs are (as yet) a significant area of risk, but this tool might also be useful to CROs. The way that the tool is constructed seems robust and comprehensive, though. Also, just a small note, the table and graph says 'H7N9' but it should be 'H5N1'.
Review 3
The threat model is formally defined, the problem is real and underaddressed, and the retrieval-grounded reasoning chain is exactly the right design choice for a tool that needs to be auditable by human biosafety reviewers. The evaluation is the ceiling. Five protocols; three of the most canonical DURC examples in the literature, two of the most obviously benign controls, is the easiest possible test set. Perfect F1=1.0 on that basis proves the system works in principle, not that it works in practice. The harder questions are false positive rate on ambiguous legitimate research and false negative rate on novel techniques outside the current threat taxonomy. Neither is tested, and both matter more for deployment than catching H5N1 and SARS-CoV-2. The multi-protocol correlation gap is worth flagging beyond future work. An adversary who knows protocols are evaluated in isolation will simply split a dangerous workflow across multiple low-risk submissions. That's not a limitation to address later, it's a fundamental constraint on the current system's deployment value that should be stated more prominently. Solo project, weekend build, live deployment, formally grounded threat model. This is the kind of work that should be developed further.
Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.
Review 1
- Explain why you think we need a bio-specific jailbreaking benchmark? Can't we assume it mostly matches the ASRs on general jailbreak benchmarks? Is there anything that makes us assume it might be different for bio than for other topics? - You could keep the introduction more crisp. The first half-page on general AIxBio advances and policy doesn't add a lot. - A comparison from your results to the results from the Biothreat Benchmark Generation framework and other jailbreaking results would have been interesting to highlight the marginal value add of your work. - Overall cool project, clean implementation and good writing, good job guys :)
Review 2
Tests whether bio-misuse safeguards in frontier LLMs actually hold under adversarial pressure, using four attack methods against four models across 40 bio-specific prompts. Responses are scored on three dimensions: whether the model refused, how technically specific the response was, and whether it was structured as something someone could follow. Results show a stark divide: Claude and GPT maintain meaningful safeguards; DeepSeek and Kimi barely refuse even direct bio-misuse requests. Why it matters: We've had benchmarks measuring what models know about biology and generic jailbreak benchmarks that treat bio as one category among many. What we haven't had is a systematic test of whether bio-specific safeguards hold when someone actively tries to get around them. The finding that Crescendo (gradual conversational escalation) is the most effective attack (and produces the most actionable responses, not just the most responses) has direct implications for how we think about multi-turn risk. What's strong: The specificity/actionability scoring split is the right design for biosecurity (i.e., knowing what reagents to use is a different failure than knowing how to combine them). The prompt construction methodology is transparent and thoughtful. The rubric could serve as a standalone reference for future work. The Western/Chinese model comparison is policy-relevant. What's missing: The automated judge hasn't been validated against human experts. DeepSeek is both a target model and the judge, which introduces self-scoring bias. Kimi is both a target and the attacker in iterative attacks. The prompt set is small and deliberately selected to elicit failures — reported refusal rates shouldn't be read as base rates.
HydraWatch: Embedding-based wastewater pathogen surveillance for federated hospital networks A reference-free, privacy-preserving wastewater pathogen surveillance pipeline for federated hospital networks. Each hospital sequences its own sewershed, embeds reads with DNABERT-2 (768-dim), and trains a local TE-VAE (Transformer-encoder VAE) on the classified pool to define "site-normal." A hybrid score (reconstruction error plus latent Mahalanobis) flags anomalous reads in the unclassified pool, the blind spot where novel pathogens hide because reference-based tools like Kraken2 can't see them. Anomalies are clustered with HDBSCAN and tracked across timepoints. Trajectory analysis flags four patterns: emerging (rising over time, including signals that appear only at the latest timepoint), persistent, transient, and declining. The early-warning signal is anything new or accelerating. Cross-site detection happens by query, not data: a hospital sends a single 768-dim cluster centroid (around 3 KB) to peer sites, who match locally and reply. Raw reads and read-level embeddings never leave the site, sidestepping the data-sharing agreements that typically slow multi-site surveillance. Pilot: three timepoints from one NY hospital sewershed (CASPER PRJNA1247874). One dominant emerging cluster (cluster 6) grew from 285 reads at T1 to 3,506 reads at T3 (×12.3 growth), driving the early-warning signal. BLAST anchoring of representative reads is queued. Scaling: 5 NY hospitals, then Northeast region, then CDC. Same cluster signature at multiple sites equals outbreak signal, ideally surfaced before clinical case counts rise. Stack: DNABERT-2(Hugging Face + PyTorch), TE-VAE (TensorFlow), HDBSCAN, BLAST. ESM-2 multi-view piloted on one timepoint; METAGENE-1 is a clean upgrade path.
Review 1
Impressive hackathon submission with high potential impact when fully developed and deployed. Submission includes a design, prototype, and proof-of-concept for federated detection of pathogen incidence dynamics via foundation model embeddings of hospital wastewater sequences. Only post-embedding centroid clusters are shared between hospitals, allowing for quick data sharing and tracking of pathogen spread, avoiding delays stemming from data privacy regulations. Additionally, foundation model embedding clusters can identify incidence dynamics of "unclassified" potential pathogens for which reference sequences do not yet exist in current database-based pathogen monitoring systems. The included code repository with submission slide deck help clearly explain the project and results. The gap that is filled by this technology is clear.
Review 2
Reference-free anomaly detection on the Kraken2-unclassified pool (the actual surveillance blind spot) combined with a federated-by-query architecture that never moves raw reads is the most technically sophisticated work in the batch. The ×12.3 growth in cluster 6 is a compelling signal, but the load-bearing next step is BLAST anchoring: an embedding-space trajectory without biological identity is a watch-list item, not a confirmed finding. Swap in METAGENE-1, get the BLAST results, and run one real two-site centroid-query exchange to activate the architecture you've designed.
Review 3
I thought the authors could benefit from demonstrating more clearly why the gap of unclassified samples matters from a biosecurity stand-point. The authors note that approaches like Kraken-2 "works very well for organisms that already have well-sequenced close relatives in the reference database — known pathogens, well-studied commensals, common viruses. It works poorly for everything else." It's not obvious to me what that "everything else" is that presents a significant concern, particularly when the authors acknowledge the approach doesn't identify any specific organism. Even with the comparison against baseline rates, seems like that would potentially introduce a variety of false positives, since it may simply be picking up some new, non-pathogenic bacteria. I also was a bit confused by the pilot. My understanding is the hackathon is effectively for only a weekend; so how did the authors have time to conduct a pilot study over a couple months? I did appreciate the author's discussion and inclusion of implementation details on scaling up multiple levels of surveiliance.
India surveils pandemic risk mainly through clinical case reports, which lag the underlying transmission by one to two weeks. No public, AI-fused, district-level outbreak signal exists for the country's 640+ districts. Pandemic Watch is an open-source biosurveillance dashboard that tries to close that gap. It pulls together five very different early-warning signals (news and ProMED chatter, climate suitability, search-trend anomalies, wastewater viral RNA, and weighted mention counts) and turns them into a single risk score for each of six WHO-aligned viral pathogen families, computed independently for every district. A small set of LLM agents handles the language-heavy parts of the pipeline (filtering noise, removing duplicate reports, attributing mentions to the right pathogen family, and writing a short human-readable briefing with per-family evidence). On a retrospective replay of the COVID-19 declaration in Thrissur, Kerala (30 January 2020), the fused score crosses the High band a full week before news chatter peaks, driven almost entirely by the wastewater signal. The system is validated on six historical Indian outbreaks and the entire stack is released openly. The same architecture transfers without modification to any LMIC with district-level geography.
Review 1
Overall really cool submission. I’ve been banging the drum about the need for combining different surveillance signals (news, wastewater, syndromic, etc.) into a single indicator for a while so it’s great to see this. What is the cost of the pipeline per day or per action (whatever the most relevant metric is)? How would the cost change if using state-of-the-art models instead of GPT 4o? Would the performance be significantly better? Where is the RNA wastewater data coming from? I’m surprised it exists this granularly for districts in India. Maybe I’m misunderstanding what the data looks like. -- Ah, now I see that it is synthetic. This is obviously the biggest limitation. I’d like to see the approach replicated with real data that is likely very noisy. I was reminded of this paper, could be interesting for you. https://pubmed.ncbi.nlm.nih.gov/38815175/ There is also EIOS from WHO, not sure how good it is. https://www.who.int/initiatives/eios Very interesting result that there is a 7-day lead time for WBE for COVID-19 in India in this post-hoc analysis. Unsure how much to trust this result but it’s a valuable sign! (ok I realise now it’s synthetic data) I would’ve liked to see a discussion on how hard it would be to implement this system for other countries or for the entire world. While you fuse 5 signals, it seems like the synthetic WBE data is doing the heavy lifting in the signal and it’s unclear to me how much the other signals contribute. This is also because syndromic clinical data isn't yet incorporated.
Review 2
The architecture is neat and has most of the datasources I would expect. I would add syndromic surveillance where you can (e.g. UKHSA Real-time Syndromic Surveillance Dashboards). The ablation table is clear and a great way to present the results. Doing the wastewater-based parts on synthetic data seems unavoidable at this point, but I have some worry about circularity with the Peccia et al. paper. The synthetic WBE data is generated with a sigmoid ramp centered at T−6, directly calibrated to the Peccia results. The system then z-scores this ramp, which saturates around T−7, and the paper reports a 7-day lead time. The chain is: Peccia says 6–8 days → the synthetic ramp encodes 6 days → the pipeline recovers ~7 days.
Review 3
This project seems well-executed, acknowledges its limitations, and shows some solid engineering skills! Most of my critiques are around the framing, impactfulness, and the science-side execution or modeling issues: - So this is framed as an early warning system, but due to the operational feasibility constraints outlined, it has to be event-driven. The triggering events are mentions in news or ProMED, so the early warning system only triggers when people already know about it and are talking about it on ProMED or in the news. The aggregation of data to get signal from multiple places is smart and could still be useful, but I would not frame this as an early warning system -- it's more like an aggregation and interpretation layer that triggers after a pandemic has kicked off to help people gain context. Since the project evaluates the continuous monitoring version and not the event-driven version, the actual deployable version of this could be much weaker - The parts of this that could serve as early warning are the Google Trends and the wastewater anomaly. I think a stronger version of this would use the anomaly in these as a trigger for other agents, rather than news as a trigger for investigating the wastewater anomaly, which seems somewhat backwards. I think if throwing LLM's at Google Trends could be used to reliably detect things before the news does, this would be cool, but I think the system would need to be engineered around this specifically - I'm sus of the inclusion of climate suitability, for a few reasons. (1) This may not be relevant for all pathogen families. (2) This is not really a signal in the way the others are, and I'm not sure it makes sense to include as part of a warning system. (3) The weighted sum equation does not appear to account for units or normalize each contributing factor, at least as documented. - I'm not sure where the weights actually come from, but it seems odd to me that climate, which isn't a signal, would be weighted higher than the wastewater anomaly signal, which is the signal most capable of detecting an outbreak - Ultimately, doing multivariate anomaly detection that can pick up on unusual combo's of signals would require a different model than this, which is essentially a weighted sum fed into a logistic - If there's a new pandemic spreading, we probably don't have the assay for it yet, so wastewater RNA monitoring systems won't really be able to warn you about it. Given that this is the part of the system most capable of detecting things ahead of the news, most of the project's pre-detection usefulness routes through wastewater monitoring of known pathogens - I do think something like this can still be useful post-detection rather than as a warning system. i.e. I could see something like this feeding into a dashboard for decision-makers that collates data and tracks what's going on geographically so they can manage an outbreak. - Basically, I think the scaffold here is cool and well-engineered, so if it were built around a stronger model and set of signals, it could be useful (depending on downstream stuff like whether officials would find it useful in their workflow etc)
We apply the mechanistic interpretability method of sparse autoencoders (SAEs) to the genomic foundation model METAGENE-1.
Review 1
One of the more rigorous executions that I've seen so far. The cross delivery validation is very robust. Would like to see the raw residual probe baseline, but understand that it was already addressed in the limitation. If there's one thing to add, perhaps, a dual-use section to address the high consequence pathogens. Not much else to comment here, I really like the paper and how it was executed.
Review 2
The organism-specific detector result is the strongest finding. The depth analysis showing early layers maintain sparse organism detectors while later layers compress into distributed encoding is also a useful observation. The gap is that the paper doesn’t show the SAE decomposition adds value over simply probing the raw model representations. Demonstrating that SAE features match or beat raw probes while being more interpretable would close the argument. The paper could also be about half its current length without losing anything essential.
Review 3
Strong submission. Applying SAE interpretability to a metagenomic foundation model is a natural extension of recent work on protein language models, but hasn't been done at this scale or in this biosecurity context. The organism-specific detector finding is the standout result. Individual latents firing selectively on Norovirus or Human astrovirus sequences with 97-99% BLAST identity and zero false positives on non-pathogen sequences is a concrete, meaningful finding for anyone building auditable surveillance systems. The main thing missing is a simple baseline. They train a classifier on SAE features and achieve an AUROC of 0.987. But they never train the same classifier directly on the raw model activations to compare. That experiment is a few lines of code given they already had the activations extracted. Without it, you can't tell whether the SAE decomposition adds anything beyond what the base model already encodes. The classification performance alone doesn't prove the SAE is doing useful work, it might just be reorganizing signal that was already there. That comparison is the experiment that would have made this substantially stronger and it was well within reach on a hackathon weekend. The cross-delivery validation is well-executed, and the multi-layer analysis is a genuinely interesting addition. The finding that organism-level specificity concentrates in early layers and compresses into distributed pathogen encoding by layer 32 is the kind of mechanistic insight this line of work needs more of. The broader direction here is worth pursuing. Metagenomic surveillance models that can explain which specific features drove a pathogen flag and map those features to known organisms are meaningfully different from black box classifiers. That's the gap between a research finding and something a public health official can actually act on.
Most biosecurity governance tools look at either how dangerous an AI biodesign tool is, or how vulnerable an organization is, but nobody has connected the two into a single framework that organizations can actually use to assess themselves. This project builds one. Inspired by Asimov's Laws of Robotics and Anthropic's Constitutional AI, the Three Laws of AI Biosafety proposes three ordered governance principles that must be satisfied sequentially before a biotool can be considered governance-ready.
Review 1
We thank the author for this submission. The presentation of three core laws, together with readiness frameworks, for biological AI models is compelling. Furthermore, it is somewhat surprising that the author finds insufficient dual-use characterization, as opposed to access controls or technological development, as the prevailing failure mode. One of the most interesting components in this work is the assessment of readiness for a given biological model. The composition would benefit from discussion in the main text as opposed to the appendix. Furthermore, regarding amendments to the framework, what would inform the need for any alterations? Do you think that the approach itself will need to be updated with newer technologies?
Review 2
This a clear and well presented idea and provides a valuable framework which others could utilise or build upon. Having three clearly defined laws broken into the smaller subsections is clever, useful, and provides a powerful framework to build governance into AI-BDTs / AI-BTs. The clear tiers and color-coding make it easily accessible to policy audiences or the general public, which helps communicate the risks and get governance systems built. This especially stood out, "The consistent finding across all eight tools is that organizations know what their tools do, but have not formally assessed what their tools could be made to do." I feel like it is well-accepted in biosecurity that many of these tool makers are very excited by the research and scientific possibilities and never think of dual-use concerns. It is nice this helps formalize and highlight that better. I also think this tool helps with governance arguments if it had mandated adoption. I could see in the reporting that argument being made stronger. Some tool makers might balk at restrictions, but if ALL tools are assessed in the same, shared framework it helps with uptake and acceptance. Limitations are well flagged that this is 8 tools assessed by one reviewer, but it is a nice proof of principle for a hackathon. The color scheme took some thinking at first, or it could just be me. Green reads as 'not dangerous' when here it really means 'governance ready' so more like 'danger can be mitigated.' But then two tools could be red and one is 'dangerous and not governance ready for XY reasons' and the other is 'not dangerous but not governance ready?' AI Capability is somewhat a broad term and it is not inherently logical that dual use falls under it, to me at least. The five-level maturity model seem reminiscent of Capability Maturity Model Integration used in software. The report could have highlighted that if it was a source of inspiration and if not, it is worth looking into that for further refinement and work on the framework. I like the rhetorical framing with Constitutional AI and Asimov but it is not exactly accurate. Anthropic's constitutional AI refers to a specific RLAIF process and Asimov's laws are maybe a bit more hierarchical than these which are more sequential. But maybe I misread this and better presentation and description fixes that. Overall a strong submission.
Review 3
Great work! The "Three Laws" framing is doing more rhetorical work than mechanical work though, the load-bearing part is a maturity rubric with a passing threshold and a stop rule. The dual-use under-characterization finding is a great real contribution and I'd lead with it. A sensitivity analysis on the threshold would tell us whether the gap you're diagnosing is real or a property of the cutoff.
As AI systems become more capable at biological design and benchtop DNA synthesizers more affordable, the biothreat bottleneck shifts from design to physical synthesis, beyond the reach of centralized customer screening. We introduce STAMP (Synthesis Tamper-evident Attestation and Molecular Provenance), a 120-base barcode an HSM-equipped synthesizer stamps into a non-coding region of every DNA it produces, attesting that a sequence originated from a registered, untampered synthesizer and was not significantly modified post-synthesis. STAMP combines cryptographic anchoring with a novel content-aware landmark map enabling forensic reconstruction of post-synthesis modifications. Empirically, the encoder achieves success across N = 2000 random plasmids and the privacy-preserving barcode landmark signal detects >=95% of kilobase-scale insertions. We do not claim to defeat attackers with jailbroken synthesizers; we prove this is irreducible. Instead, STAMP is a cost imposer and evidence generator: it converts every viable attack into a forensically suspicious artifact or a supply-chain-visible event.
Review 1
STAMP seems like a working technique for barcoding but I'm not seeing the specific problem it's trying to address. The case study presented shows that you would be able to detect a tampered plasmid using the STAMP barcode but (1) this doesn't always mean there's malicious intent and (2) the detection of tampering doesn't prevent acquisition of the plasmid. I understand the solution, but I am not seeing the problem this solves or the attacks that this mitigates. The attacks described in the submission point out attacks against STAMP, which I think are sufficiently mitigated. What seems missing to me are the risks in the biotech ecosystem that STAMP mitigates.
Review 2
Note: This is relevant for general trackability of synthetic sequences, but the question below about relevance for *AI* safety in particular is thus mis-posed. AI really doesn't have to be part of every question these days. So I'm leaving that set to its default neutral because that question doesn't match the more-general techniques in this submission. This is an interesting idea which has a number of issues which might make it infeasible, but still worth considering: (a) Synthesizers make errors. I see you're trying to ameliorate that with your Hamming codes, but those only cover the barcode itself. I'm assuming that the occasional errored landmark wouldn't be that critical because the chances of that one single base being wrong aren't individually high and the chances of *many* of the landmarks being wrong is even smaller, assuming that we can assume statistical independence. (b) ...buuut: doesn't even a single-bit error break the hash? Even a single incorrect landmark base would invalidate the hash, and I don't see way around this unless you run ECC/RS/Hamming along the landmarks, too, which I guess you could do. (c) It's unfortunate that this can only detect ~kb insertions. Yes, it'd be hard to do better without too much overhead, but it also means that someone who can just assemble from oligos will defeat this, so it's raising the bar but isn't going to be a complete solution. (d) HSMs are *expensive.* Having personally advised benchtop vendors on their machine security, it's apparent that even spending the small amount of extra money it takes for a single-board computer which allows adding a TPM chip (as opposed to ones which don't even have a place on the board one may be attached) isn't something vendors are going to do unless forced, such as via legislation. Expecting them to do a good job about secure boot chains and the like isn't likely in the near future absent some way to make this a commercial priority. (e) A public ledge is going to be a hard sell, because it's going to be hard to convince vendors you're not hiding information flows if the machine reaches out to the network -- but it's not an *impossible* sell. OTOH, claiming that you can "sunset" a public ledger is very likely infeasible because if it's public, someone can keep a copy. I don't see how you have both at once. (f) A 12-bit synth ID isn't long enough if this industry ever really takes off. You probably don't need IPv6's 128-bit overcompensation for IPv4's 32-bit limit, but you should think harder about a representation which can at least be variable-length if the number of synths grows. (g) I'm not sure an attacker couldn't just make their own primers. *Maybe* section 3.5 says they can't do this, but I'm uncertain.
Benchtop DNA synthesizers have democratized sequence generation, bypassing the traditional biosecurity chokepoint of centralized synthesis facilities; however, manufacturing synthesis hardware remains prohibitively difficult for isolated bad actors, so regulating the software on such machines, made by a few companies, remains a promising avenue towards security. We present SynthShield, a pre-screening and logging software that leverages ESM-2 protein language model embeddings to prevent bad actors from synthesizing such malicious sequences. Presently, screening solutions (BLAST, etc.) only compare sequence similarity, missing functionally similar but physically distinct dangerous sequences, a growing concern in the new AI-assisted research regime that allows small teams of bad actors to iterate faster and with greater available attack surface. Furthermore, current software has little public or even governmental observability for synthesized sequences, as protecting researchers' IP remains an important challenge. To tackle this, we pair our screening software with a tamper-evident black box and a public blockchain (Ethereum L2) to record sequence-aware hashes of synthesized DNA, which allows post-hoc identification of bad actors by law enforcement, even for CRISPR-altered bio-weapons that were indirectly created with the help of these synthesizers, without revealing critical information about the precise sequence. In this paper, we evaluate our integrated pipeline and show that the ESM-2 screener achieves AUC 0.977 on a remote homology test set where the BLAST baseline achieves only AUC 0.711, an improvement of 0.266 that directly addresses the AI evasion scenario.
Review 1
This felt like 3 papers stitched together. the ESM-2 vs BLAST piece is solid and would stand on its own. the blockchain + audit log + 5-attack-class stuff is described but not really tested, which made me trust the rest less. would've been stronger at half the length focused on ESM.
Review 2
The authors developed an ESM 2 screening protocol to detect AI-generated synthetic homologs and added a logging mechanism to their pipeline, as well as an assembly method for split orders. The end-to-end pipeline is multilayered and well executed, even if some version of each step in the process has been previously implemented. However, the study does have limitations, many of which the authors address at the end of the paper. In addition to what was mentioned in the paper, more limitations or suggestions would be: 1) Generating synthetic homolog test sets using other biodesign tools. Currently, ESM2 is mainly designing the training and test sets. I would be curious if the test set was generated with other tools and how that would affect the metrics. 2) Low amount of sequences was acknowledged, and moreover, different functional classes need to be tested with more complex mechanisms and proteins. Using the authors' method, this would require training with a very large dataset, but it may be possible to find efficiencies in a pipeline that would not require as much compute. 3) Using BLAST as a metric. I would have used the free screening tool as a basis of comparison rather than just BLAST
Most AI biosecurity filters evaluate isolated prompts. This misses the real threat vector: dual-use biological capability is accumulated incrementally across long conversations. BioGuard shifts the screening boundary from the isolated prompt to the continuous conversational state. Tested against a live frontier model (GPT-5.4), we identified a severe safety-utility tradeoff: current frontier models achieve safety via broad refusals that actively disrupt legitimate bioscience workflows (triggering a ~4.5% false-positive rate). In contrast, BioGuard traces Biological Knowledge Transfer (BKT) across entire sessions. Our prototype demonstrates that by isolating multi-turn capability accumulation, we can maintain necessary safety visibility while preserving operational utility for benign scientific research.
Review 1
This project is a thoughtful and creative approach to expand the biosecurity review to an overall conversation, rather than prompt- or end-product-level screening. The author also makes impressive efforts to make the BioGuard method interoperable and reproducible. I also appreciated the clarity with which contributions and methods are presented. There are several aspects that would improve the project. First, the "Depth" axis of Biological Knowledge Transfer (BKT) encompasses both procedural and tacit knowledge, and it would be useful to understand how each form of knowledge contributes to the scoring procedure. Second, the nature and overall layout of the benchmark used in this study is not yet clear, and would benefit from description in the main text. Finally, while the GPT 5-based filter does experience slightly elevated false positive rate (0.045), the maintenance of excellent recall (1.000) and precision (0.965), versus BioGuard, with recall of 0.289 and precision of 1.000, could be more beneficial in everyday applications. In other words, while the text states that there is a stark safety-utility tradeoff in favor of BioGuard due to elevated GPT 5-based false-positive rate, one could make the case that BioGuard's false positive rate of 0.000, at the expense of lower recall, is a more substantial tradeoff.
Review 2
Proposes monitoring entire AI conversations for incremental biological capability accumulation (what the paper calls Biological Knowledge Transfer) rather than screening individual prompts or final outputs. Each conversation gets scored on misuse relevance, procedural depth, and capability uplift, producing an auditable decision record. Why it matters: This is pointing at exactly the right problem. The capability uplift literature makes clear that dangerous knowledge accumulates across multi-turn interactions, not in single prompts. Current safeguards mostly evaluate messages in isolation. The conversational window in between is largely unmonitored, and that's where tacit knowledge transfer happens. If this worked, it would fill a critical gap in defense-in-depth. What's strong: Excellent problem identification. The decision envelope design (with request IDs, thresholds, anomaly records, and audit logs) is governance-ready infrastructure that would be useful regardless of which detector sits behind it. The paper is clear, concise, and honest about what works and what doesn't. What's missing: The detector catches only 29% of positive cases. That's too low for safety screening. More importantly, the ablation studies show that individual scoring components sometimes outperform the integrated multi-turn system (meaning the aggregation logic, which is the core contribution, is actually making things worse in some cases). The entire evaluation is on synthetic data, which can't test the indirect, contextual knowledge accumulation that the system is designed to catch. The keyword baseline detecting literally nothing raises questions about whether the benchmark is well-constructed.
Nucleotide virulence factor benchmarks are inflated by ~0.30 AUROC from organism confounds and gene-family leakage. Under same-strain negatives and gene-family-disjoint evaluation, 64-dimensional codon frequency with logistic regression generalises perfectly to novel genera (gap = 0.006, p = 0.097 NS). The signal is HGT-derived codon usage deviation - linear, genus-invariant, and pretrain-free.
Review 1
Thanks for your submission! Your thorough quantitative approach here is commendable, and the clearly spelled out limitations and future direction are great. The write-up itself is quite jargon-heavy and a bit on the long side, and I would like to see more discussion of the big picture relevance - what are the consequences of the overestimation, and what should we do about it?
Review 2
This paper is rigorous and the inflation decomposition finding is important for anyone building or evaluating nucleotide-level biosecurity classifiers. The same-strain design and gene-family-disjoint evaluation is an actionable practice for design. The main thing working against it is presentation density. The statistical analyses that makes the science trustworthy also makes the paper hard to absorb quickly. A shorter, punchier framing of the core result up front, with the full statistical components in supporting sections, would let the strength of the conclusion come upfront. The HGT mechanism is compelling but acknowledged as unvalidated and the amelioration-score correlation they describe as future work would substantially strengthen this. The team has demonstrated deep knowledge & thinking.
Review 3
This paper shows that published performance numbers for DNA-level virulence factor classifiers — the kind that could screen raw synthesis orders — are significantly inflated due to two testing mistakes that compound on each other. Once you fix the test design, a simple model that counts codon frequencies is the only approach that actually generalizes to organisms it hasn't seen. The proposed explanation is that dangerous genes acquired through horizontal transfer still carry a subtle "accent" from their donor organism's codon preferences. Why it matters: If you're evaluating nucleotide-level screening tools and relying on published benchmarks, those benchmarks are probably overstating performance by a wide margin. This paper quantifies exactly how much and why. The proposed fast pre-filter for synthesis orders (no GPU, no pretrained model, runs in linear time) is a practical contribution to screening infrastructure. What's strong: Best methodology in the batch. Same-strain controls, family-disjoint evaluation, 20 random seeds, pre-registered analysis, careful statistical reporting. The finding that more complex models consistently overfit while the simplest one holds up is clean and actionable. What's missing: The HGT mechanism is a hypothesis, not a validated result — the title overstates this. They haven't computed performance at the false-positive rates that synthesis screening actually operates at (below 1%). No testing on engineered or codon-optimized sequences, which is what screening actually needs to catch.
Dataset Bottleneck Analysis (DBA) — Project Summary Biosecurity screening removes dangerous biological sequences from public databases, but a critical question remains unanswered: does removing those sequences actually prevent an AI-equipped adversary from reconstructing them using what remains? DBA is an open-source framework that answers this question empirically. We introduce a Redundancy Score (R ∈ [0, 1]) that measures how much of a restricted sequence set can be reconstructed from the public corpus. Applied to 4,844 real UniProt Swiss-Prot proteins with a cluster-aware split, DBA reveals a striking result: while BLAST-style k-mer screening achieves R = 0.064 (0% of sequences recoverable at ≥ 0.90 similarity), ESM-2 protein language model embeddings achieve R = 0.847 — 13.2× higher — with 95.5% of restricted sequences recoverable at the same threshold. This is the AI threat multiplier: the factor by which language-model-aided adversaries exceed the reconstruction potential assumed by sequence-identity policy. The most alarming finding is the toxin experiment. K-mer screening makes toxin proteins appear 64% safer than average (R = 0.023), creating a false sense of security. ESM-2 reveals the opposite: toxin ESM-2 R = 0.873 (98.6% coverage), exceeding random proteins (0.847) and exposing a 32× gap between what sequence-identity screening assumes and what a language model adversary can actually recover. DBA runs end-to-end in under 22 minutes on a laptop CPU with no GPU required. It is designed as a pre-deployment audit tool for screening programme designers: run it on your proposed screening category before setting thresholds, or you may be calibrating against the wrong adversary.
Review 1
Very interesting approach. The intersection of new protein language models and other biodesign tools with existing screening controls has not received much prior attention, to my knowledge. As the researchers reveal, this is an oversight because existing screening selection and calibration tools might give a misleading threat picture when considered in the context of new BD tools like ESM-2. This is an important vulnerability and the recommendation to use ESM-2 over k-mer is novel and valuable. Perhaps even more valuable is the R measure, which can be reapplied as new AI tools are released, in order to update screening selection. The project is well-thought out and well-executed. One thing that would make it a little stronger in terms of presentation would be to make the connection between the R measure and the recommendations clearer (especially for non-bioinformatics people like this reviewer). They might also provide general guidelines for applying this measure in future. Overall, am excellent project and valuable contribution to AI security.
Review 2
The core finding is genuinely striking and easy to grasp which is that current screening doesn’t just underperform against AI-equipped adversaries, it could actively mislead. The framework is lightweight enough to actually get used. The weakness is that the central claim, that sequences scoring above 0.90 similarity in embedding space are recoverable, is asserted rather than demonstrated. That equivalence is doing a lot of work and it’s not obvious it holds for the properties that actually matter in a biosecurity context. The experiments also run on generic protein databases rather than the sequences that real screening programmes actually restrict, so the jump to a policy recommendation is a bigger leap than the paper acknowledges. One concrete fix would be to show that high ESM-2 similarity actually predicts functional equivalence for at least one relevant property, whether that’s toxicity, receptor binding, whatever is available. Without that the policy recommendation sits on an assumption.
Background & Problem Standard biosecurity screening uses sequence identity (BLAST). ProteinMPNN redesigns toxin sequences below every BLAST threshold, achieving 0% detection across 723 redesigns. Proposed Solution & Mechanism A linear ESM-2 probe maintains 93.9% detection with no retraining. Using interPLM Sparse Autoencoders (SAEs), 50 features are identified at 205× compression that explain probe performance. These features are amplified by redesign (mean transfer ratio 1.28) because ProteinMPNN preserves structural fold topology—precisely what the circuit encodes. Security Analysis & Evaluation A four-tier attack taxonomy reveals the security boundary lies at gradient access: ProteinMPNN (6.1% evasion) vs. white-box attacks (100%). Direct Probe Attribution identifies layer 32 as the bottleneck (r = 0.992 redesign–toxin circuit correlation). SAE-based probes recover 38% of “Double-Evaders” that fool both BLAST and dense linear probes, demonstrating direction-sensitive detection beyond Euclidean boundaries. Discoveries & Conclusion Zero-shot scanning discovers 248 UniRef50 candidates enriched 4.75× for secreted signal peptides, including cross-kingdom fungal effectors (54% are currently annotated as “Uncharacterized” in UniProt). The probe’s security guarantee equals the privacy of its weights.
Review 1
Very interesting results and good progress for a hackathon weekend! I think people already suspected that pLMs could be quite helpful for that, but nice to see these numbers. I mostly wonder how much that changes with shorter sequences, though. This seems to be the crux, at least for me.
Review 2
An overall very strong effort. The comparison to BLAST screening is compelling and well elucidated, the demonstration of utility of a simple linear probe on a frozen model is motivating, and the variety of experiments probing the nature of this detector are mostly compelling. However, the work would be improved via further consideration of what it means for the detector to be vulnerable to a "white box" gradient attack, and the writeup suffers from some internal inconsistencies (e.g. caption vs content in figure 1, different assertions about the number of double-evaders).
Geometric Biosecurity is a continuous threat severity scoring system designed to address vulnerabilities in current biosecurity screening software (BSS) caused by AI-designed protein variants. Developed at the AlxBio Hackathon in April 2026, the system shifts from traditional sequence similarity matching to a functional embedding space by utilizing ESM-2 protein language model embeddings. By applying singular value decomposition (SVD) to extract a spectral threat axis, the model produces a severity score (0–1) that remains highly effective even when sequence identity is low. Validation on over 179,000 sequences demonstrated significant performance gains, particularly in the "AI-redesign evasion zone" (20–40% sequence identity) where it outperformed identity-based scoring by 31.6%, and in detecting short peptide toxins where existing tools are often weakest. Intended as a complementary second-stage screening layer, the system adds a necessary dimension of geometric discrimination to protect against sophisticated synthetic biological threats.
Review 1
Very strong proposal. Benchmark on real proteinmpnn redesigned sequences?
Review 2
I like that this shifts us in the direction of functional rather than sequence screening. While I can't comment too much on the technical approach, the 31.6% improvement over identity-based scoring in the AI-redesign 'evasion zone' seems like a big improvement.
Review 3
This is a neat idea and I do think there's a lot of value in augmenting existing synthesis screening algorithms with follow-up analyses by specialized biological tools. However, I am not sure how well this method will extend to novel / slightly modified viral proteins who are by their very nature less distant in protein space than toxins from benign proteins. It will also be interesting to see whether the method still works with shorter sequences. I am uncertain whether this particular approach is a promising research direction.
BioCalibrate is a benchmark for action that tests whether AI models refuse biologically dangerous requests for the right reasons, not just because a topic sounds scary, but because it poses actual operational risk. We ran 338 biosecurity queries across 8 major AI models (2,704 total evaluations), organized by Digital Biosafety Levels (BDL-1 to BDL-4, modeled on physical lab containment levels), and measured whether refusal behavior matched real-world threat severity. The results show a systemic failure where safety systems learned to pattern-match on pathogen names rather than assess danger, leaving the most genuinely dangerous queries largely unblocked. - A reusable CLI tool, interactive dashboard, and open dataset that generates Model Biosafety Scorecards showing exactly where each model's safety calibration breaks down - 28% best model refusal rate on BDL-4 weaponization queries against an expected 100% - Fear Risk Inversion where models refuse Ebola more than Influenza despite Influenza being the higher operational threat, statistically confirmed ecosystem-wide (FRI +0.099, p<0.05) - 12.1% cross-model bypass rate showing queries refused by one model are answered freely by another, proving safety is an ecosystem problem that per-model fixes cannot solve - 97% compliance on bio-AI tool orchestration queries where models freely generate dangerous protein design pipelines at BDL-3/4 levels - 3 models benchmarked on CBRN topics for the first time in any published study - Dashboard: biocalibrate.org - Dataset: https://huggingface.co/datasets/lightmate/biocalibrate - Code: https://github.com/BioCalibrate/BioCalibrate
Review 1
BioCalibrate introduces a domain-specific benchmark for refusal calibration, moving past binary 'refuse/assist' metrics to more nuanced Digital Biosafety Levels. I was not able to see the 338 prompts in the Hugging Face link. Authors should not publicize prompts that are higher BDL-3+ for infohazard reasons.
Review 2
The 12.1% cross‑model bypass rate is doing far more work than the metric can support. As defined, “at least one model refuses while another complies” over a pool of 8 models will climb just because you add more systems, not because the ecosystem is especially unsafe. At 2 models the number would drop, at 20 it would rise, by construction. I’d want to see that curve plotted against pool size, plus a baseline where you run the same calculation on benign BDL‑1/2 queries. Without that, 12.1% looks like a quirk of the evaluation harness rather than a property of deployed models. The deterministic regex parser is the right choice for reproducibility, and I appreciate that you left κ = 0.571 in the text instead of hiding it. Still, moderate agreement at n = 160 across 8 models means each model’s estimate carries a wide interval, and Table 3 leans too hard on tiny gaps. Qwen3.5 at 28% and Kimi at 22% on BDL‑4 sit inside overlapping confidence intervals, and Figure 1 even makes that visible. The story reads as if there is a clean ranking when the data only supports loose tiers. Either increase the per‑model validation size or describe the results as bands of behavior instead of an ordered list.
Review 3
The Fear:Risk Inversion framing is very policy actionable. Matched adversarial-benign pair design is methodologically sound. However, the BDL framework has been introduced prior to this (https://arxiv.org/html/2602.08061v1). Reusing the BDL terminology for query/refusal tiers might risk conceptual confusion given that this term was previously populated by a group of established researchers in AIxBio. Recommend either renaming the framework or explicitly frame it as an extension of the Bloomfield et al. with a different scope. Otherwise, the writing and framing are nicely executed.
Biosecurity screening of synthesized or environmental DNA mostly asks: does this look like something on a known-toxin list? That fails when an attacker (or evolution) changes enough letters to escape the lookup while keeping the protein's harmful function. We test whether biological sequence models trained on proteins and DNA can spot that function directly. On 4,060 toxin and benign coding sequences, evaluated with close homologs (evolutionarily related sequences) kept out of training, an ensemble of a DNA model (Evo2 7B) and a protein model (ESM-2) catches 85.8% of toxins at a 1-in-100 false-alarm rate, versus 72% for Evo2 alone, 71% for ESM-2 alone, and 55% for a simple 5-letter-pattern baseline. By comparison, the open-source COMMEC policy screen (biorisk-only mode) flags only 16.8% of the same toxins, showing that learned models catch toxins the curated lookup databases miss. After mutating 60% of amino acids to disguise toxins, the protein screen still recovers 98%; the baseline drops to 41%. Reliable behavior requires DNA fragments ≥1,500 bp.
Review 1
Incredibly interesting and valuable work. Anything that can be done to make screeners more robust - especially to evasion attempts - is a welcome addition to the screening toolset. This work is innovative and has direct implications for improving biosecurity. While the limitations are adequately spelled out in the paper, the clear next steps are to "red team" the approach against purposeful evasion / substitution attempts.
Review 2
Overall, this is a promising idea and a good start investigating it, with many thoughtful methodological elements. The presentation is admirably thorough. The five-nucleotide pattern baseline is unmotivated, and seems like a straw man comparison. The key weakness of the study is that, as far as I can tell, the benign and toxin portions of the train/test dataset have dramatically different length distributions. Unless I'm missing something, the model could simply be learning the rule "short = toxin". (Unless I misunderstand something, model embeddings could encode sequence length in some way.) In what other ways might the benign and toxin distributions differ? How were benign examples selected? Relatedly, are these representative of the kinds of sequences sent to synthesis companies (dominated by synthetic and heavily engineered constructs)? I'd like to see this study redone with two major changes: (1) a stronger method than five-nucleotide word frequency for a comparison baseline, and (2) a more carefully balanced and synthesis-representative dataset.
Review 3
This seems like a reasonable set of results, but I had difficulty figuring out whether AAs were run or just nucleotides. You do talk about "BLAST's protein search" but then you keep talking about DNA, and there are certainly systems that can detect recoded sequences, e.g., they look at the peptide, not at the DNA,, and it's unfortunate you didn't try this with any of those, e.g., not just using a best-match system. It is certainly possible to detect recoded sequences using these systems. It's also unfortunate that you can't do this for < 1500bp seqs; there's a lot of opportunity there for assembly-based mischief. Nonetheless, figuring out these sorts of results is a worthwhile endeavor. A nit: "HHS Common Mechanism 2024" feels like Ai slop/hallucination. This seems to be confusing HHS guidelines with the IBBIS Common Mechanism implementation. Also, in A.7: Couldn't someone else just train a model as you did?
Open weight protein design models might generate toxic virulent proteins. Current classifiers are accurate but not interpretable or explainable. In this work, we train Sparse Autoencoders (SAEs) on RFD3 and RF3, leading open source protein folding and design models. We find SAE features with meaningful correlation to toxicity and virulence and the top classifier reaching 0.87 AUROC. Results at https://www.raft.bio/blog/saeber. Submitter wants to publish. Manual entry: submitted via Discord DM after Framer Form closed.
Review 1
You, clearly, know how every element of this project works and how it can be improved which is really great to see. If I were you, I'd get into contact with mech-interp-for-bio-model researchers, this could be a great stepping stone to do this research full-time if you're thinking about going into AIxBio (I see you're a MATS Fellow so I assume you're more general AI safety). The text was refreshing to read, I'd love to see what happens if you or someone else implements your Future Directions.
Review 2
This project provides a strong proof-of-concept for applying mechanistic interpretability techniques to building better safeguards against protein design models. The progress made is very impressive given the limited time and resources. I would like to congratulate the author for this piece of work, well done! The potential to better understand virulence predictions is particularly valuable for tracking high-risk use cases of protein design models, including identifying emerging trends and revealing novel threat models. A richer description of the results or more explanation on the website would benefit the presentation, especially for readers less familiar with the topic (just have an LLM do it!). This is completely understandable given the constraints of working as a one-man team within a hackathon timeframe. Showing an example success case, such as a successful identification of a virulent motif, could be a powerful demonstration.
Review 3
A really strong submission. Your novelty might be a bit overstated in places, as this type of work has been performed and is ongoing, but you do highlight that and and directly identify it is novel to apply it to RFD3/RF3. I also appreciate your limitations related to time and compute resources were very transparent and accurate, and this still represents an impressive amount of work done technically well for a hackathon. The experiments that were performed were performed well with proper controls. This is good, rigorous ML research. Notably the finding that RFD3 memorizes family folds (ie. block 6 → near random under clustering) is genuinely interesting and biosecurity relevant. It potentially implies the model's safety properties depend on whether the input distribution overlaps with training families. Though n=44 per fold under clustering is also fragile, but again makes sense with the time and compute restraints. But this is nice groundwork to be followed up on in future studies. This really does deserve a more in-depth exploration and is a tantalizing finding. The block 12 RFD3 cluster-split finding is interesting and the polysemanticity untangling interpretation is plausible but it falls into the same category as a nice hypothesis or foundation worth following up on. Layer selection is admittedly ad hoc. You flagged this and proposed the knockout-pLDDT alternative for future work, I think that is a great idea and the correct move. The headline 0.817 vs SOTA 0.92 gap is larger than the text suggests. I would say this is getting close but 'within striking distance' may be a bit strong.
Current DNA synthesis screening relies on sequence homology, which AI protein design tools like ProteinMPNN evade by generating functional threat variants with as low as 7% sequence identity to known threats. We introduce FuncScreen, a contrastive learning framework over frozen ESM-2 embeddings that screens by predicted biological function rather than sequence similarity. Trained with supervised contrastive loss, hard-negative mining, and embedding-space Mixup augmentation on 985 curated pore-forming toxin and benign homolog sequences, FuncScreen achieves 1.000 AUROC [1.000, 1.000] on standard and hard-negative splits. On 4,100 ProteinMPNN-designed adversarial variants, FuncScreen maintains 0.991 AUROC [0.988, 0.993] where homology drops to 0.952 [0.944, 0.959]. We provide a preliminary certified robustness analysis under biologically structured mutations (1,000 Monte Carlo samples, 100 sequences), finding an empirical-certified gap of at most 1%. We validate generalization on a second threat family (ribosome-inactivating proteins, AUROC 0.962) and report out-of-distribution false positive rates.
Review 1
This is an important contribution to the screening literature. I appreciated the use of two different threat families and the use of multiple evaluation splits. I thought the inclusion of the Leave-One-Subcategory-Out Cross-Validation was particularly interesting. I particularly enjoyed your framing of function-based approaches as a compliment to more traditional sequence based approaches. This struck be as constructive, rather than adversarial. It would also be interesting to see this tested with more threat families.
Review 2
Seems like a very valuable contribution to a future problem and I'd like to see this work taken further.
The emergence of AI-powered biological design tools has necessitated a shift in biosecurity from sequence-alignment methods to function-prediction-based analysis. Current DNA screening protocols relying on BLAST are increasingly vulnerable to de novo designed sequences that evade similarity thresholds while retaining pathogenic functionality. We propose a scalable biosecurity screening pipeline that utilizes the ESM-2 transformer architecture to extract deep biological features from viral sequences, followed by similarity retrieval using FAISS. We implement a multi-class classification scheme to generate a functional "ID card" for proteins across five axes: Baltimore classification, Molecular function, Host category, Cellular tropism, and Zoonotic potential. Evaluating 15,154 viral samples from UniProt, our approach achieves a superior aggregate F1-score of 0.89 compared to 0.77 for BLAST. The embedding-based pipeline demonstrates significant performance gains in complex domains such as Host category (0.94) and Cellular Tropism (0.98), where sequence identity often fails to reflect biological roles. These results indicate that high-dimensional embeddings successfully capture the structural and functional constraints of viral evolution, providing a robust, semantically aware guardrail for modern biosecurity.
Review 1
This submission is well structured, clearly written, and uses visualization where appropriate, all to convey the ideas and findings very well. I appreciate the novel approach to synthesis screening, clearly accounting for de novo AI-enabled design. This is the way synthesis screening will have to take in the future. However, as acknowledged in the paper, a discussion of what to do with the ID card is lacking. This will be crucial to turning this submission's approach into an effective screening tool. For example, follow-up work may use expert surveys to determine which sequences should be flagged, especially how to trade off between different risk factors.
Review 2
This was a really great problem to pick and the approach itself was reasonable, but the project fell short in a couple of areas for me: 1) The defining problem is how do we catch de novo designed sequences with screening tools that can act on sequences alone. The work only tested the screening pipeline on known sequences, and did not test or discuss how well this might generalize to out-of-sample (i.e. truly de novo) sequences that could have the same function. My guess would be: it doesn’t. 2) The compelling approach to this problem (which was rightly identified by the author), was to try and use AI tools to predict functionality from sequence alone. However, the work only did this in a fairly narrow sense (specifically in the ‘Function’ prediction task) while the rest of the classification tasks were merely that - classifying various descriptors of the virus sequence, such as host and tropism. These are not really functions per se, and many aren’t really relevant for predicting whether a sequence is dangerous (e.g. Baltimore classification. Also, many of these classification tasks seemed kinda simple (as evidenced by BLAST pretty much getting it right across the board). This data would be a grind to gather for 1000s of virus sequences, but more directly relevant data to predicting how dangerous a sequence is would be e.g. human cell infection, viral titre, entry assay data, fusion assay data, genetic stability, mutation rate, glycan usage, immune evasion, host protein binding sites etc. 3) Conclusions were overstated (0.89 vs 0.77 for simple classification tasks doesn't seem like a ‘significant advancement’ to me), and there were some pretty generic/recycled explanations of how high-dimensional embeddings have magically captured ‘billions of years of evolution’. Several assertions were off the mark in a biological sense - as a practising virologist it was news to me that there are special ‘Zoonosis motifs’ that can predict zoonotic potential. If only! 4) There was little to no discussion of limitations or caveats. I would have expected to see some kind of model validation metrics or other things to make sure your models are not overfitting to the prediction tasks, for example. Would have been really great to mention the limitation that this pipeline would not necessarily work for true de novo sequences. Sorry if I sounded mean - I think overall you were on the right track! It was a great problem to pick, I think the methodology was sound in principle and you presented the results clearly and efficiently.
Review 3
The problem this paper addresses is real and important. The field is excited about moving towards function-based approaches to sequence screening, away from BLAST-based screening. The five-axis ID card framing is intuitive, and the ESM-2 plus FAISS pipeline is well-implemented for what it actually does. Some issues that stand out : - The introduction frames this as a defense against AI-designed sequences that evade similarity-based screening. But the entire evaluation uses reviewed UniProt proteins, well-characterized, well-annotated sequences that are almost certainly similar to ESM-2's pretraining data. No de novo designed sequences are tested anywhere. The core threat model is stated but never evaluated, which means the biosecurity claim rests entirely on inference rather than evidence. Granted, evaluating on genuinely de novo designed sequences is extremely hard; wet lab validation is not a realistic ask, and even computationally generating good test cases is non-trivial. But this should be explicitly acknowledged as a core limitation rather than left implicit. - The BLAST comparison overstates the contribution. BLAST is a sequence similarity tool — it was never designed to predict host category or cellular tropism. Beating BLAST at functional classification is not a meaningful benchmark. The right comparison is against purpose-built protein function classifiers, several of which exist in the literature; frameworks like PROBE explicitly benchmark ESM-2 embeddings on function prediction tasks and would have been the appropriate baseline. The aggregate F1 improvement of 0.89 vs 0.77 is presented as the headline result but it doesn't answer the question the paper asks. -There's also a data leakage concern worth flagging. Labels were derived from UniProt metadata, and ESM-2 was pretrained on UniProt sequences. The model may be partially recovering annotations it was exposed to during pretraining rather than genuinely learning functional biology from sequence alone. This is unacknowledged. - The tropism result warrants caution. After excluding 13,985 sequences as uninformative, only 1,169 samples remained across four categories, for a total of roughly 117 test samples. At that scale, one or two misclassifications swing the F1 significantly. A result of 0.98 on approximately 25 examples per class is not robust enough to draw strong conclusions from. -The zoonotic result is the most honest and interesting part of the paper. BLAST outperforms the embedding approach here (0.83 vs 0.80) and the explanation — that zoonotic potential is tied to conserved sequence signatures that local alignment captures better than semantic embeddings — is well-reasoned and adds genuine nuance. For this approach to actually catch novel dangerous proteins, you'd need training data that explicitly labels dangerous function, not just taxonomy and host category. That data doesn't exist publicly, and curating it would itself be an infohazard. The paper doesn't engage with this at all, which is the most important limitation it leaves unaddressed. What's been built is a strong functional annotator for known viral protein space. That has real utility — it outperforms BLAST on classifying divergent but known sequences, which matters for surveillance of natural variation. That's a legitimate contribution, just a narrower one than claimed.
Current DNA synthesis screening operates per-sequence, per-vendor - blind to distributed attacks where threat sequences are fragmented across providers. BioChain closes this gap with two layers: a cryptographic audit trail linking cross-vendor fragment orders via locality-sensitive hashing and blind-signature tokens, and an ML scoring engine using ESM-3 embeddings with a permutation-invariant Set Transformer to classify reassembled fragment sets. On 5-fold cross-validation holding out entire toxin families, BioChain achieves AUC 0.907 ± 0.032. We characterise two failure modes - poor calibration (ECE = 0.1296) and neutralised-mutant blindness, framing this as an existence proof for function-based distributed-attack screening.
Review 1
The fragmented order problem is serious, and a great topic for probing technical approaches to risk reduction. In that sense, the topic selected and approach could hold great importance. In one specific note on operationalizing the data, I would suggest the authors reconsider the "hard negative" category as tied to "neutralising mutations confirmed in literature" and a firmly benign judgment. There are sufficient public cases to warrant concern of engineering strains that are benign to humans in ways that would become virulent, and public history of a few actors using this approach for creating novel biological weapons going back decades (even before modern tools that could facilitate such work). Perhaps results could be a gradient rather than 3 categories, and more dynamically consider factors such as end user behind the order. Indeed, if a legitimate user is ordering mutations that are deliberately design for solely therapeutic or experimental purposes, why would they fragment the orders in the first place, unless it was perhaps to avoid the costs of a patented product? This phenomenon could itself be an important flag that it is worth a deeper look. Broadly, it is wonderful to apply the latest tools and models to risk reduction, given the chances of their misuse. Given the relative newness of ESM-3 and what is currently publicly stated about its security/safety training and features, I wonder if the authors considered use of ESM-2 as well? Re the use of ESM-3, it will be a defining feature of our era that we have to constantly weigh what is stated publicly about how these models could be applied for risk creation and reduction, and determine what is the best timing for openly describing tests of utility based on their results, the robustness of studies, the potential deterrence value of showing that deep tools are being applied for risk reduction, etc. Overall definite strong questions, focus, and effort for a hackathon.
Review 2
The Layer 1 crypto story reads like the big swing, then never lands. Section 3.1 throws Order Commitment Records, blind-signature customer linkability, SimHash-based cross-vendor fragment matching, and a CT-style Merkle log on the table. There is no simulation, no stress test, no back-of-the-envelope failure analysis of false-link rates once you hit real synthesis-traffic volumes. SimHash overhang on near-duplicate fragments is exactly the kind of thing that quietly kills this design and it’s not even scoped. I also don’t buy the trust model as written. Who holds the global pepper, who runs the Merkle log, and what stops a well-lawyered vendor from opting out under GDPR or contractual privacy obligations. The ML section is the one place where the work feels grounded. I’d frame Layer 2 as the actual contribution and mark Layer 1 as a design proposal that still needs simulation, a Sybil story for Karma Scores, and a deployment model that a real vendor could sign up for.
Review 3
That's a really nice approach to a mostly untackled problem and work I would like to see continued. I am not an expert on using pLMs for screening so I can't comment too much on that work. My sense is that toxins are somewhat unrepresentative of the threats we mostly expect and that performance on much shorter, offset fragments will be worse. Ultimately this is just one solution to evaluating the more interesting part, which is imho the cross-vendor detection! This could be a really nice follow-up project. I am a bit doubtful that overhang similarity is the right (or even functional) metric, but it's a great starting point—same with the cryptographic approach. Having a design draft on this is good stuff and I'd like to see this continued somehow.
Current DNA synthesis screening relies on sequence-homology searches (BLAST/mmseqs2) against curated threat databases. This mostly works when a submitted sequence resembles a known pathogen-associated sequence, but it is structurally mismatched to a world in which protein design models can produce functionally coherent variants that are sequentially divergent from known proteins, precisely the capability that biological design tools like RFdiffusion, ESM3, and Evo 2 now provide. Here I propose and implement a four-phase latent-space anomaly detection pipeline that uses the internal representations of biological foundation models to flag structurally complex threats (prions, superantigens, novel toxins, immune-evasive peptides) based on their functional geometry in embedding space rather than their surface-level sequence as a way to “catch” potentially unexpected biological threats from being synthesized This approach unites cross-modal structural scoring (ESM3), background-corrected likelihood ratios (Ren et al. 2019), domain-specific sparse autoencoders with mandatory dead-salmon controls, and contrastive representation engineering. On a synthetic validation dataset comprising 500 benign sequences, 300 threat sequences across three pathogen classes, and 200 hard-negative de-novo designs, the calibrated ensemble achieves an AUROC of 0.997 with clean separation between all threat classes and benign controls. Crucially, the hard-negative de-novo designs cluster distinctly from both threat and benign populations in embedding space, and the linear probe baseline alone achieves near-perfect discrimination, suggesting that biological foundation models encode threat-relevant functional information in linearly accessible directions. I present a modular implementation (3,200+ lines, 13 passing tests) with dual-use review guidelines, explicit methodological controls that address known failure modes in the SAE interpretability literature, and a SECURITY.md protocol for respo
Review 1
This project provides a strong architectural proof-of-concept. While the end-to-end pipeline is a highly useful and complete deliverable, the foundational claims regarding threat discrimination require broader evaluation and validation with better test sets. Given the time constraints of the hackathon, the scope and execution of this project are appreciated.
Review 2
Excellent considerations of the specific needs of synthesis screening and the use of tools for biosecurity. You did a wonderful job in thinking about how your approach would be used in the real world and not just as an academic exercise. I was also struck by the range of different pathogens you discussed. It is unusual to see prions included in synthesis screening! Equally your efforts to consider the dual use implications of step 4 were useful and interesting. I liked framing your approach as a compliment to (rather than replacing) traditional sequence-based approaches.
Review 3
An ambitious project combining multiple discriminants between threatening and non-threatening protein sequences into an auditable pipeline. The scope is broad and the technical execution appears quite sound. The report is mostly clear. The clear description of training different models, building an ensemble model, and analyzing contributors to ensemble performance is tight. The "hard negative" class is clever and makes the report much more interesting. Some suggestions: * The description of dead-salmon SAE validation is not fully clear. How were margins set? Are we sure that features linearly via the randomized network (eg just linear projection) are not relevant, even if not a direct product of learning via the trained network? * The relationship between models used (ESM2 vs 3, e.g) isn't obvious * As a broader extension, it'd be nice to compare the embedding based method to a purely structure based method. E.g. if AI tools are nearly-exactly generating known structures with new AA sequences, structural similarity to known threats could be a good test. * Despite the emphasis on the pipeline as key deliverable, it's not obvious to me from the report what the major contributions of the pipeline are, and how they differ from other ensemble models or modular software.
Probing Risk Representations in Protein Language Models tests whether ESM-2 activations can provide a supplementary biosecurity signal for DNA synthesis screening. Current screening relies heavily on homology matching, which may miss AI-designed protein variants that retain dangerous function while diverging in sequence. I trained linear probes on ESM-2 representations from 1,264 labelled proteins across pathogen families and evaluated generalization on three fully withheld families. The main result is negative: global probes fail to generalize well and show no statistically significant advantage over a BLAST keyword baseline. The project also identifies two screening-relevant failure modes: high false positives on scrambled sequences and degraded detection on short fragments. Family-specific probes perform better in-distribution, suggesting a possible routed screening architecture combining homology tools with local representation probes.
Review 1
rigorous negative results are actually valuable for the field!
Review 2
Great topic/questions for a sprint. Nice results to continue exploring. Appreciate the use of ESM-2 and applied in this way. I appreciated that info hazard concerns were taken into account by authors proactively. It's not an info hazard as such, but if the authors carry forward and publish, they could consider not including stats on certain high-sensitivity families (eg, pox) and stating as such.
Review 3
I really like this work, because having negative results is *useful* and you rarely see them presented. Please never ever typeset an entire paper in italics. Literally half of the submissions I've reviewed were entirely set in italics, and I suspect it's because everyone was working from a submission template which used italics to give instructions about what to say in various sections, and you just inherited that formatting. But you should be more vigilant; please don't make reviewers' eyeballs bleed. :) A nit: There is no SecureDNA "consortium." This reads like it might have been AI hallucination, or perhaps just a misunderstanding.
Screening tool that uses LLM embeddings to determine if a protein sequence is functionally similar. Performs much better than BLAST and shows that the embedding space for models trained on proteins seems to be very related to the space in which proteins are functionally similar.
Review 1
The project correctly identifies that screening based on sequence homology has large blindspots, and that embedding-based screening could solve some of these. I think it's a useful proof-of-concept. I thought there were three main things that don't quite work. First, the variants were generated by conservative amino acid substitutions within biochemically similar groups. ESM-2 was trained on such sequences because this is what happens during evolution. So, ESM-2 giving high cosine similarity to these variants is largely a consequence of experimental design, not evidence that it detects function. So, I don't think this is a great proxy for function, given how you designed sequences. You have shown that ESM-2 detects biochemical similarity, which is almost guaranteed given how the variants were made. It would be cool to see how a generating variants with ProteinMPNN affects ESM-2-based detection. Second, you don't have specificity as a metric. Sensitivity alone doesn't mean much for a detector. Third, your baseline is a bit too low, a more realistic one would have been to run the sequences through commec.
Review 2
Using protein language models to generate measures of similarity between sequences seems like a natural and sensible approach to me. (I am not an expert in DNA synthesis screening). I can imagine it fitting into a wider set of algorithms run as a part of the screening pipeline, and that it would improve accuracy. I found the write-up quite clear, and appreciated the effort put into validation and empirical results.
DNA synthesis screening prevents bad actors from obtaining the physical sequences needed to produce dangerous toxins and pathogens. However, current screening tools like BLAST and SecureDNA rely on sequence similarity to known threats, and recent work has shown that AI protein design tools such as ProteinMPNN can generate functional toxin variants that evade these screens at rates approaching 100%. We introduce a screening approach that trains an activation probe on ESM-2 embeddings to recognize toxic function across diverged sequences; on held-out synthetic variants that are ~40% identity to their parents, our classifier maintains 86.7% recall while BLAST recall collapses to 46.7%. This provides initial evidence that protein language model embeddings can be a robust second layer of defense for DNA synthesis screening, complementing current similarity-based methods.
Review 1
You have generated important new knowledge and contributed to the challenge of moving from sequence-based to function-based screening. I loved the discussion on the importance of lab experimental confirmation and the challenges it brings. I would like to have understood a little but more about how the time and compute factors limited your work. I think understanding the limits of what such approaches might achieve, and connecting that to resource availability adds another dimension to this interesting and important challenge.
Review 2
Toxins/toxicity are a really important focus area. I fully agree with the importance of more advanced synthesis screening approaches, and testing utility of ESM-2 embedding may help drive toward function-based approaches that we need. Results seem reasonable and clearly reported. If authors pursue this further, I would suggest continuing the thread on toxins as equally important to the task of applying the approach to bacteria and viruses (as they are publicly noted in terms of potential bioweapons activities by certain countries in the latest State Department treaty compliance reports). Appreciate the focus on ESM-2 rather than other open models for which potential risks/info hazards are less well explored to date.
Review 3
This addresses the right problem at the right time and the pipeline design is genuinely thoughtful, particularly the cluster-aware evaluation and the ablation showing ESM-2 embeddings carry enough functional signal to generalise without synthetic training data. The central weakness is that the headline results rest on 15 sequences per divergence level, which means the recall numbers could shift substantially with a handful of different outcomes. Acknowledging variance across runs is honest, but it also undercuts confidence in the specific numbers the paper leads with. The false positive rate tripling relative to BLAST also needs more engagement, because in a real screening deployment that’s the number that determines whether providers actually adopt the tool owing to operational costs. Scaling up the synthetic evaluation set and stress-testing the false positive rate at operationally realistic thresholds would turn this from a promising proof of concept into something deployable.
Benchtop DNA synthesizers may soon enable bioweapon synthesis in individual labs without hardware-enforced controls. We propose a hardware design with three layers of defense: sequence screening, a regulator signature the device refuses to run without, and physical monitoring of the synthesis process. The first two reuse hardware primitives from AI chip governance. The third is novel, and addresses an attacker who submits a benign sequence and physically tampers with the device to produce a hazardous one instead.
Review 1
With the increasing accessibility of benchtop synthesizers, knowing how to mitigate the synthesis of sequences of concern is vital. This paper gives concrete recommendations for screening and authorization in a benchtop synthesizer and how to prevent tampering of the synthesizer. The recommendations against tampering were interesting and well worth consideration. However, I found the recommendations for screening and the requirement for the regulator to pre-approve each sequence to be impractical. The vast majority of sequences are benign, and having them require approval would be a huge burden. Rather, sequences with high homology with sequences of concern should be pre-authorized only. There was also little data to validate the approaches. Recommendations on how the screening tool can be updated without tampering and how it would handle AI-generated oligos would have been useful as well.
Review 2
I found the problem and framing to be quite good in presentation. There was some jargon used from the hardware/AI/cyber-security side, but I think most could follow. I thought the connection to printer ink cartridge authentication was a clear example, and I think more analogues or exemplars could have been used in other proposal areas. Since this was an evaluation and under a time-limit, the designs are understandably conceptual, and Figure 1 provides useful context. However, an additional diagram(s)/table(s) illustrating the broader threat landscape and potential failure points would have been nice. One area that was neglected is how calibration, maintenance protocols, and service contracts would fall into this design (ex. how might calibration drift affect pipetting or volume-detection and then be adjusted). This would likely be an area where security could be firmed up, but it would be useful to identify where it may provide failure-points or access to bad-actors.
Review 3
This is a nice set of countermeasures, but there's an unfortunate feasibility problem with the entire idea: Having personally advised benchtop vendors on their machine security, it's apparent that even spending the small amount of extra money it takes for a single-board computer which allows adding a TPM chip (as opposed to ones which don't even have a place on the board one may be attached) isn't something vendors are going to do unless forced, such as via legislation. Expecting them to do a good job about secure boot chains and the like isn't likely in the near future absent some way to make this a commercial priority, much less adding all kinds of hardware to check that the machine is producing what it thinks it's producing. It would be *very nice* if vendors really did this, but absent incentives, the chances of vendors actually incorporating any of these ideas seems near zero. (And it's not just the hardware; doing security *right* takes expertise, which isn't the core competency of a synthesizer producer, and hiring those people also takes money. So the problem here is the incentives.) As for the technical details: (a) The guarantee processor seems to have all kinds of issues re vulnerability to attacks, staleness, revelation of nonpublic hazards, cost, etc, mostly because you're having it recapitulate the work of screening a second time; this seems to make it large and with a large attack surface. In particular, asking the GP to recompute the DOPRF means asking it to be at least as powerful as the main CPU in the benchtop. Given that you're citing SecureDNA's system here (which was designed for benchtops), what you should probably do instead is to take advantage of the "verified screening" mode, which cryptographically signs over the results and a hash of the input sequence. Then the GP need only check that signature, along with the other state-of-the-machine verification it's already tasked to do. (b) The ping time-of-flight isn't novel (not your problem but that of the paper you cite); it's quite old. But the problem with it in the case of benchtops is that you're at the mercy of the typically terrible network infrastructure of random labs, which often have very high latency and jitter (for all you know, the benchtop is on a wireless network) and also tends to disenfranchise non-first-world labs because their bandwidth is typically even worse. (It's even worse if it's also competing with the network traffic from running the SecureDNA protocol or of any other high-usage devices on that lab's network.) Depending on ping times is a good way to randomly inhibit legitimate synthesis due to poor infrastructure. ("AI chips" are likely being run in a first-class data center, which is an *entirely* different network environment than the typical university biolab.) (c) Reagent verification is a nice idea, but again, expensive. That's unfortunately the crippling flaw with most of the things presented here, even if we wish is weren't so. (d) Locking cartridges so they can't be rearranged also runs into the same failure mode as use of crypto (and the DMCA's anti-circumvention provisions) in the printer market: extreme vendor lock-in. This is a problem for customers and not all courts have looked favorably on the very concept, for precisely that reason. In the case of reagent swaps in particular, the SecureDNA system is resilient to them: its screening already accounts for that possibility and checks all 2! = 24 permutations simultaneously at zero additional computational overhead. (This doesn't lead to an increase in false positives because DNA is very non-random.)
Current U.S. biosecurity legislation (S.3741) mandates synthesis screening but places obligations on providers, not on benchtop synthesizer hardware post-sale. Edison, Toner & Esvelt (2026) demonstrated that unregulated DNA fragments sufficient to assemble 1918 influenza can be purchased for approximately $3,000 from dozens of providers, none of which verified identity or reported the attempt. Meanwhile, generative models like Evo 2 can now design functional viral genomes with novel proteins absent from any screening database. We present BioCompliance, an on-device compliance engine that transposes Anti-Money Laundering (AML) and Know-Your-Customer (KYC) frameworks to DNA synthesis. The system implements three deterministic enforcement modules: (1) tiered researcher credentialing, (2) evasion-zone sequence flagging, and (3) an Anti-Structuring Engine that detects suffix-prefix overlaps (15–40bp) in temporal order histories, blocking split-order assembly attacks before physical synthesis. A Biosecurity Officer dashboard with SAR export and built-in red-teaming enables conformity assessment per S.3741 §4(a) (5)(A). Designed as a behavioral complement to sequence screening (SecureDNA, IBBIS commec), BioCompliance targets an invariant that persists regardless of sequence novelty: the physical overlap required for fragment assembly
Review 1
Risk Tiers (1–4) - may want to make these a little more multifaceted than the home organization (or lack of it) of the user; and/or have processes across the layers that allow some dynamism on the tier calculation. Evasion detection - this would need to be set up to evolve as potential evasion techniques do. Solvable, but it is a perennial part of reality. SAR review - this would have to have an entity required to do this as banking industry does. It would augment the piece to at least acknowledge that such a policy path (or roughly equivalent voluntary equivalent by benchtop seq. producers) would be required. These functions from the financial sector have been mentioned for awhile as analogues and potential models, though it's nice to see operational approaches pulled together like this.
Protein safety classifiers, used to flag toxic, virulent, or otherwise hazardous sequences, are increasingly built on top of pretrained protein language models (PLMs), yet little is known about how these models represent harm-related properties or what this implies for their reliability. We probe ESM-2 (esm2_t6_8M_UR50D) on three binary classification tasks of biosecurity relevance: peptide toxicity, pore-forming toxin (PFT) identity, and virulence. Using mass-mean and logistic-regression linear probes applied to CLS-token activations at every transformer layer, we report three findings. First, the pretrained backbone, which has never seen harm-related labels, already encodes these tasks in a form that is largely linearly recoverable, with probe accuracies typically reaching ~75–90% in intermediate and deeper layers. Second, fine-tuning yields its largest gains on the mass-mean probe rather than the logistic-regression probe, suggesting that it primarily improves alignment of existing task-relevant structure with class-mean directions, rather than substantially increasing linear separability. Third, zero-shot cross-task evaluation reveals partial but non-trivial transfer among the three tasks, consistent with shared underlying structure, with virulence-trained models generalizing most broadly and PFT-trained models producing an inversely correlated signal on general toxicity. These results suggest that current PLM-based safety classifiers may rely heavily on pre-existing, linearly accessible representations, potentially limiting robustness to distribution shift or adversarially constructed sequences. While linear probes demonstrate that harm-related information is present in model representations, they do not establish that deployed classifiers causally depend on the same features. Taken together, our findings highlight both the promise and limitations of PLM-based safety screening and motivate further work on robustness and failure modes.
Review 1
Great work on mechansistic interpretability. How do representations drive modelb ehavior?
Review 2
It's a good diagnostic paper, asking a great question from a biosecurity perspective: what information does a protein language model already “learn” about its harmful properties? The linear probing methodology used here is valid and appropriate, and the discovery that there is harmful information already embedded within the pretrained representation and fine-tuning simply modifies separability is intriguing, especially with the additional cross-task analysis indicating that it's the same structure being modified. Limitations are identified in the work itself, making it more credible. What can improve: the novelty of the work is questionable, in that linear probing has been done before, and the work does not expand the methodology used. It's not clear how this research leads to objective changes and impact. Specifically, while robustness issues are discussed, no example of potential failure points such as adversarial attacks on the classifier are provided. Further testing could demonstrate the robustness problem in more detail. It's not quite clear how this applies to existing protein classifier systems, and an experiment would be nice here. Lastly, the paper lacks any specific recommendations on how safety pipelines could be improved with these results.
Review 3
As someone with only a glancing knowledge of computational biology, I feel underqualified to speak to the exact methodological design choices made by the authors. That being said, I believe that it is important to gain a better understanding of how PLMs "understand" elements of harm in order to both assess and improve their robustness against evasion attempts. This paper provides valuable indications with respect to the representations of harm in ESM-2 and sets the stage for further research in this area.
https://dgault2007.github.io/cloud-lab-compliance/ Cloud labs make it easy to submit and run experiments remotely, but they make oversight harder when review focuses on individual reagents or single steps instead of the whole experiment. This project is a proof of concept biosafety workflow analyzer that screens structured protocols for biosafety, biosecurity, chemical hygiene, shipping, hazardous waste, human-material, facility capability, or custom policy triggers.
Review 1
This is a well-presented submission, with a clear theory of change. I invite you to read on current cloud lab risks and mitigation approaches and cause prioritization in biosecurity, starting e.g. here https://substack.com/home/post/p-192022274. The "controlled-material" rule looks for the literal string "select agent" rather than matching against the HHS/USDA Select Agents and Toxins list. The work would benefit from embedding even a static copy of that list; same for a CDC chemical-terrorism-agent list, IATA dangerous-goods classes for the shipping rule, RG3/RG4 organism lists for the BSL-mismatch rule. The risk score and confidence formulas and how they correspond to the risk level are not discussed in the paper. They should be presented with a brief explanation and justification of the chosen values. The confidence formula is confusing and I am not sure it is used as intended (e.g. missing metadata lowers confidence which can lower threat rating? If I understand correctly, this is not properly discussed in the paper) The LLM is asked to second-opinion the deterministic screening based on a summarized view, not to independently analyze the protocol, which is not acknowledged in the paper. The work would also benefit from comparing the tool to some kind of baseline, e.g. naive keyword matching.
Review 2
This project represents a first step towards (partially) automating protocol screening to address biosecurity concerns of workflows in cloud labs. The demo showcases an elegant GUI where the tool meaningfully extracts content from submitted protocols and uses an LLM for review. I would like to congratulate the author for this piece of work, well done! The project focuses on building an end-user tool rather than being a research project as it lacks validation. The helpfulness of such a tool depends on the accuracy of its results, yet no empirical evaluation of the output was provided. The results section focuses on what the dashboard does rather than how well it screens, yet the latter is arguably more important as a proof-of-concept. Discussion on limitations was also lacking. Several questions immediately jump out, including calibration issues of LLMs (such as false negative/positive rates and how their judgement compares to ground truth), refusals by LLMs on dual-use protocols, the often inflated confidence levels reported by LLMs, and the weak distinction between biosafety and misuse risk (i.e. a protocol may be biosafe but still highly hazardous).
Review 3
Safety assessments for protocols for cloud labs is an important issue, and improving that is a really useful contribution. I also appreciate the author's focus on early triage, rather than adjudication. I was left a bit confused on the approach for doing the actual workflow analysis, and whether this was automated or human. If automated, seems like much more detail is needed on the approach: how rules are implemented, what specifically the rules are, false positive rates, etc.
Current DNA synthesis screening infrastructure presents critical vulnerabilities to generative design tools. AI-designed protein variants can evade sequence-homology filters by altering sequence identity while retaining toxic function. Coordinated split-order attacks using short fragments bypass standard vendor screening. Bio-Shield addresses these vectors by shifting to a Zero-Trust defense-in-depth architecture. The Biorisk Triage Orchestrator (BTO) is a modular pipeline that acts as a Managed Access Wrapper for biodesign tools and a Layer-2 inspector for synthesizers. Approach integrates Overlap-Layout-Consensus (OLC) assembly to detect fragmented hazards, sliding-window Protein Language Model (ESM-2) scanning to catch AI-obfuscated chimeric toxins, and cyber-entropy checks for digital malware. Delivers a robust, cryptographically-audited deployment framework for future biosecurity integration.
Review 1
The Bioshield pipeline proposed integrates a multi-tiered approach to more robustly screen against various obfuscation approaches. The managed access feature of the pipeline is also interesting, but light on the validation and enforcement mechanisms. I am also skeptical about the predictive projector for such short fragments, but would be interested in seeing future work on it. The paper appropriately acknowledges the limitations and next steps, given the short duration of the hackathon. The successes outlined in the paper are encouraging, and it will be interesting to validate the package against a larger dataset of sequences and provide more details on the implementation in screening tools.
Review 2
This is a well-scoped systems concept that addresses real and important weaknesses in current biosecurity infrastructure. The combination of upstream controls, sequence reconstruction for split attacks, and embedding-based scanning reflects a good understanding of how attackers might operate. The strongest part of your work is the systems thinking. You are not relying on a single detector, you are building a layered pipeline that acknowledges different failure modes. The explicit inclusion of OLC assembly for fragmented sequences is particularly useful and often overlooked. The main limitation is that this remains an architectural proposal with limited empirical grounding.
Review 3
I don't really see the benefit the approach provides, particularly in focusing on toxins. Toxins are generally quite easy to acquire; ricin can be made from caster beans. Although sure, chimeric toxins are harder to acquire, that's not something adversaries typically want or need. Toxins are generally assassination weapons, because they are highly difficult to aerosolize, and numerous easy, pre-existing methods exist for targeted killings such as regular ricin, guns, bombs, knives, rat poison, etc.
The CDC's National Wastewater Surveillance System already detects anomalies in viral signal. The gap is what happens after the alert fires — public health officials receive a percentile number with no context for what's driving it, no assessment of which populations are at risk, and no specific recommended actions. BioSignal addresses this data-action gap with a three-layer pipeline built on top of existing CDC infrastructure. A data cleaning layer removes undocumented sentinel values from the raw NWSS dataset — including a 32-bit integer overflow artifact in the ptc_15d field that would otherwise generate false URGENT alerts from database errors rather than biological signal. A scoring layer ranks sites by population-weighted priority using a formula that combines site-normalized percentile (primary signal) with log-scale population as a tiebreaker. An LLM intelligence layer — activated only after the statistics confirm an anomaly — generates structured situational reports (SITREPs) covering catchment profile, signal drivers, and jurisdiction-specific recommended actions, with a priority tier of URGENT, HIGH, or ELEVATED. The core design principle is strict separation between statistical detection and LLM contextualization. The model does not decide whether an anomaly exists. The statistics do. This prevents the circularity failure mode where an AI both detects and explains a spurious signal. Validated against the December 2023 JN.1 variant surge, BioSignal correctly surfaced New Jersey, Las Vegas, and Boston as the top three priority sites 1-2 weeks before national hospitalization peaks. The system is fully open-source, runs on public CDC data, and requires no specialized biosecurity infrastructure to deploy. GitHub: github.com/lvjr3383/AI_Safety/tree/main/biosignal
Review 1
I think using LLMs to explain statistical signals does seem interesting to me. There is lots of work (and domain expertise) that goes into understanding and explaining changes in CDC data streams. I can imagine a benefit of an approach like this would be that it would be able to take in signals of different kinds (from different data sets) and synthesise them. That would be beyond the current wastewater data stream. Of course LLMs are not the most explainable of all methods though. One way to make the current project stronger would be to validate it more systematically (it's not that convincing currently, but understandable given the timeframe).
Review 2
Clear, useful, tested on real CDC NWSS data, strong data-cleaning work, excellent presentation. Need to validate against clinical outcomes data before any operational use.
Review 3
First off, I appreciate the attention to data cleaning, the clear separation between the statistical signal and the LLM's contextualization of that signal, and the clarity of the presentation! My main critiques are as follows: On the score formula: I'm not sure what motivates this formula, and it seems like it'd be better to just use the percentile directly and rank by that or come up with a more principled way to incorporate population. The log population term currently does not serve as a tiebreaker for sites with the same percentile (as described); it's just added to the percentile. It also seems like you'd use population density, rather than population, if you were trying to account for the increased transmission potential in urban areas. There's no explanation for why those particular coefficients were chosen and no normalization, so I suspect it would be easy to plug in numbers and get results that contradict what the score is trying to track (e.g. a high population but low percentile can give you a higher score than a low population with a high percentile, and this is irrespective of how spread out that population is) On the LLM interpretation of the anomaly: I suspect this is mostly the LLM recapitulating is training data. In the examples, it doesn't really get enough context (from what I can tell) to provide reliable information, and a lot of this is probably hallucination. Also, even if it is correct, it doesn't validate the approach, as you'd need to use surges that are after the training cutoff to see if it generates useful summaries. On validation: Besides the training cutoff issue, the score will obviously give similar results for many historical cases as the percentile CDC already uses, seeing as it is just the percentile * some number + some function of population. Really validating this would require a much larger set of cases, and I think you'd see more failures at the extremes from issues I outlined above. On impact: My guess is that public health officials whose job is to respond to these incoming percentiles already have the context they need to interpret them, or at least more context than the LLM can provide from its training data + the same number the officials get. Most of the issues here would probably stem from institutional access to data and cross-communication between institutions, rather than an inability for the employees to know what the number means and what to do about it. If anything, the score formula and LLM layer on top of the percentile are clouding the important, interpretable variables that ought to drive action, and it seems more efficient for them to review percentiles and population densities themselves than to read the LLM's recommendation report. Another small note is that the CDC's wastewater monitoring is not pathogen-agnostic, so amplifiers on that system are not very impactful for addressing emerging threats. The most useful version of this kind of thing would need to be co-designed with the public officials themselves, and would probably requiring doing something like RAG with documents and data sources from multiple institutions to provide and summarize context that they do not already have access to, and which is not already part of their daily workflow.
Locus automates researcher credential verification for DNA synthesis screening by mapping ORCID publication profiles to NCBI taxonomy identifiers. A novel trajectory feature detects deliberate biological capability acquisition that static credential checks cannot capture.
Review 1
The Chrome extension functions well and behaves as intended. Both the report and the demo video are clearly and professionally assembled. The concept of using ORCID and PubMed to automate KYC credential verification for DNA synthesis customers is reasonable, but it comes with notable limitations, many of which the report already acknowledges. In practice, this approach verifies only a narrow subset of customers (primarily academic researchers), who are also among the least likely groups to misuse synthetic DNA. In academic environments, orders are often placed by lab managers or core facility staff who may have no publication record at all. For such cases, incorporating institutional email‑domain verification could strengthen the workflow. Additionally, individuals with only review, commentary, or perspective articles on the listed organisms, without any hands‑on research experience, would likely pass the current screening, which weakens its effectiveness. The implementation also appears to misunderstand the split‑order problem. The issue is not when someone orders DNA fragments from multiple listed organisms; rather, it arises when a customer orders different segments of a gene or genome from the same listed organism in separate transactions. Overall, this is a good and creative attempt with some novelty, but its impact is limited by the constraints of the proposed KYC solution relative to the broader challenge.
Review 2
Locus presents a browser-based tool for automating researcher credential verification in DNA synthesis screening by linking ORCID profiles to biological taxonomy . The core idea is relevant for AI-biosecurity but the current system is somewhat simplified. In practice, robust KYC-style verification would likely require richer and more diverse data sources beyond taxonomy alone. Also, it may struggle with edge cases such as generalist researchers, sparse publication records, or non-academic actors. That said, execution is good for a hackathon project, including the video demonstration.
This project proposes tools for faster identification, classification, and response to AI biosecurity “warning shots”—near-miss events that reveal catastrophic risk—so as to translate those events into governance action by relevant policy, intelligence, and other actors. We use a formal definition of warning shots and associated criterion to analyze 21 global biosecurity events, finding that warning shots are typically recognized but rarely converted into binding governance. We propose analysts and governance actors use a Governance Conversion Framework to improve future AI biosecurity warning shot responses. We develop an AI biosecurity event dashboard and set of analytic tools to help biosecurity stakeholders monitor world events, categorize emergent biosecurity risks, and trigger faster response with relevant government and industry actors.
Review 1
Innovation There is a genuine gap here: no public-facing warning-shot dashboard exists specifically for AIxBio, and analysts and policymakers do need a structured way to triage emerging biosecurity signals against historical patterns. You correctly identify this gap and build something concrete to address it. The combination of a warning-shot classification + risk score + governance-conversion lens is a reasonable contribution package, and the headline insight (that the bottleneck is conversion, not detection) is a useful framing. However, you do not engage with relevant prior literature. The Institute for Security and Technology released an AI Loss of Control Risk: Indications & Warning framework in February 2026 that uses the same intelligence-community I&W methodology you draw on, with a five-level severity scheme. It touches a different set of threat scenarios (AI loss of control, not bio), but it is the closest direct analog to what you're building and should be cited. More fundamentally, your Governance Conversion Framework is a domain-specific application of focusing-event theory in political science (Birkland's After Disaster, 1997, and 30+ years of follow-up work; Kingdon's multiple-streams framework). Your "Stage 3 stall" finding is a special case of what focusing-event scholars have been documenting for decades. Near-miss management literature in industrial safety, aviation, and mining is a third unacknowledged precedent (e.g., MSHA's quarterly near-miss reporting mandate is a working example of the governance infrastructure you say is missing). Adding a paragraph that situates GCF within these traditions would substantially strengthen the contribution. See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2300441 for a starting point. Execution Your Risk Score is overcomplicated. Score = (S − 5) / 20 × 100 simplifies to (S − 5) × 5. Either justify the form or simplify. Relatedly, if the post-hoc step is to subtract 5, why not score each pillar 0–4 to begin with? Justify those choices, or don't introduce them. You then describe the methodology as if the score were more granular than it actually is. With five integer pillars in [1, 5], the composite has at most 21 distinct possible values, all multiples of 5 after normalization. Your priority band "Low (0–29)" therefore contains only 6 reachable values, not a continuous range. Reporting the score as a 0–100 number implies a level of resolution that the underlying scale does not support, justify the choice. A bigger issue with the scoring is that it is ordinal and you do not justify how the ordinal scale corresponds to actual quantitative risk. Adding ordinal categories is mathematically problematic. Expert validation is out of scope for a hackathon weekend, but the paper would benefit from at least acknowledging the measurement-theory limitations and discussing methods of addressing them in future work. See e.g. https://arxiv.org/pdf/2103.05440 for the standard reference. The C1–C5 criteria are vague in places. What "epistemically accessible at the time" means in a coding rubric is not operationalized, what counts as accessible to whom, on what evidence base, judged at what time horizon? Standard SOTA for qualitative classification frameworks is to report inter-rater reliability (Cohen's κ or Krippendorff's α) across at least two independent coders. You don't do this, for understandable reasons given the time budget, but this should be explicitly listed as a limitation rather than left implicit. Your case selection methodology is not described in enough detail. The Data Collection appendix lists source types (peer-reviewed literature, institutional reports, investigative journalism, government records) but not the inclusion criteria, search strategy, time bounds, exclusion rules, or selection process. You acknowledge "potential selection bias" generically, but don’t describe how borderline cases were handled. Were you systematically searching or working from familiar examples? Without this, readers cannot assess whether the headline pattern reflects the world or your sampling. You should state explicitly whether the dataset is intended as exhaustive, illustrative, or convenience-sampled. Your headline finding that most cases stall at Stage 3 rests on counts in a hand-curated sample of n=21 (and the GCF stages themselves were derived from n=13). There is no statistical treatment of this claim, no confidence interval, no comparison against a base rate, no engagement with the focusing-event literature where this same pattern has been studied at much larger scale. The conclusion states "as AIxBio risks continue to converge and accelerate, the cost of stalling at Stage 3 will only increase." Cost is never assessed anywhere in your framework, your score measures risk, not expected cost of governance failure. Either add a cost-of-inaction component (or even a placeholder for one), or rephrase the conclusion to match what your tool actually measures. You write that "Figure 3 illustrates the distribution of cases across time, tier classification, and risk level, showing increased clustering in recent years." Recent-years clustering is exactly what you'd expect from any non-exhaustive OSINT-curated dataset (recent events are better-documented and more salient to curators). If you want to claim a real temporal trend, you need either an explicit exhaustiveness argument or an analysis that controls for differential discovery rates across decades. Presentation The paper itself is clear, well-organized, and easy to follow. Section structure is logical, the three-stage pipeline (OSINT scanning → risk scoring → GCF assessment) is communicated cleanly in Figure 1, and the writing is at a level appropriate for a policy audience. Figure 1 writes the composite formula as S = H · E · C · V · DR, which parses as multiplication, while Section 3 of the text gives S = H + E + C + V + DR Figure 1 is also not very readable, especially being placed before the part of the paper that explains it. Accessibility of the dashboard is poor. The low-luminance green-on-black text appears to fail WCAG AA contrast minimums in several places, and color carries a substantial semantic load (yellow = high, green = pass, red = critical, dim green = inactive) without redundant text encoding, which loses information for the ~8% of men with red-green colorblindness. See https://www.w3.org/WAI/WCAG21/Understanding/contrast-minimum for the standard. Adding a high-contrast mode and redundant text labels alongside color codes would address both issues.
Review 2
Covers an important topic in AI x Bio risk, which is recognizing early warning signals and then acting upon them in a way that mitigates future risk. Dashboard works very well for what it sets out to do and is useful as a means of getting an overview of historical early warning. As an educational (or even eventually a research) tool it is a great addition. However, it is in the governance aspect where - as revealed by the project - the key deficits lie. I am not convinced overall that a dashboard, even one that logs governance failures in responding to early warning, gets at the crux of the problem. This is because I do not believe that the problem is one of insufficient awareness on the part of authorities, but rather insufficient prioritization and over-politicization of biosecurity risks. I do not see a dashboard - even a very attractive and user-friendly one - doing much to mitigate these obstacles. Maybe around the margins, because they allow a clear and concise story to be told, but I do not think this is going to do much to move the needle on actual government response to risk. Overall, a well-executed project with strong informational and educational value, but it is unlikely to directly contribute much to decreasing biosecurity risk.
Review 3
I will mostly focus on the dashboard and some of the classifications I looked at in detail. In principle, I believe warning shots can be a really powerful motivator for policy changes and I do fear that AIxBio will only be taken fully seriously once an actual misuse event happens. That being said, I am a little doubtful of some of the scores, e.g. why various flu transitions and spillovers are high-risk AIxBio warning shots. I agree that those events are high risk, but they also don't particularly demonstrate a vulnerability that was previously unknown, or are AIxBio related. The website can be very nice as a resource for biosecurity researchers searching for more context and rough assessments of different historical and ongoing events, but I am a little doubtful that it can serve as a policy-oriented platform for warning shots.
This work examines the extent to which a safety monitor for large language models (LLMs), configured for biosecurity, can be evaded using ordinary natural language, without adversarial prompting or technical expertise. We evaluate a monitor deployed via API across five bio-relevant topics under three controlled conditions: direct requests, single-turn contextualized queries, and multi-turn conversational sequences. The results show that detection behavior is strongly associated with the semantic orientation of the evaluated request. Operationally framed queries are consistently flagged, whereas descriptively framed queries, particularly in academic contexts, frequently evade detection. In multi-turn settings, sequences composed exclusively of descriptive turns achieve systematic evasion within the evaluated environment, even when they produce technically detailed outputs in the full assistant setting. Based on these findings, we introduce the concept of semantically mediated distributed extraction, in which no individual turn triggers detection, but the overall interaction produces sensitive information. We also document a divergence between API-based monitoring and full-model behavior, suggesting that architectural factors may influence robustness against context-based evasion strategies. We propose incorporating semantic orientation, distributed extraction, and cross-environment divergence into evaluation frameworks for AI safety systems. These results highlight the relevance of semantic and interaction-level dynamics in the analysis of monitoring mechanisms.
Review 1
Great job and interesting project choice! Your writing is verbose and some sections are repeated. That being said, most sections are necessary and very clear & understandable. Your work also formally identified the effect of something many biosecurity researchers probably felt was important, paving the way for better safeguards down the line. Send this to biosecurity researchers you think might benefit from knowing about it!
Review 2
Clean A/B/C design. The monitor claim is custom prompted Claude API binary classifier, so I would suggest either reframing the claims of biosecurity LLM gatekeepers or add a second monitor (e.g. Llama Guard). I also suggest strengthening the description of browser evaluation by providing some kind of structured rubric or even a coarse one.
Review 3
The A/B/C framework is more disciplined than typical jailbreak demonstrations and the complete-separation result in the logistic regression is a cleaner empirical finding than most evasion papers offer, but the work's framing travels further than the evidence supports because the monitor evaluated is a strawman. A one-line system prompt asking for FLAG or PASS is not representative of production biosecurity monitoring, which uses constitutional classifiers, multi-stage pipelines, or fine-tuned detection models, and showing that a minimal binary classifier fails on descriptive academic framing tells the reader something about that specific configuration rather than about LLM monitoring generally. The architectural divergence observation compounds this, because the comparison is between the author's custom classifier and the full Claude assistant in the browser, which are not two monitoring architectures but a strawman classifier and a production system doing different things, so the inference that model sophistication moderates evasion is not actually supported by the comparison. The most valuable single revision would be evaluating the same framework against at least one more sophisticated monitor (the Sharma et al. constitutional classifier the author already cites is an obvious candidate), which would either generalize the finding or sharpen it into the architectural result the paper currently gestures at without testing.
Pandemic response depends on accurate detection and measurement. However, testing data has blind spots or poor visibility in many areas. During the critical early pandemic stage, predicting which communities and neighborhoods have blindspots can help control pandemic spread. Missing tests also matter in determining population antibodies, group vulnerability and several other pandemic measures. In contrast to extensive work on pandemic mortality, the data of who takes tests and who does not is just as relevant but neglected in Pandemic management. This project builds a prototype pandemic response atlas using Covid-19 France and the greater Paris area as a case study. The prototype is an R Shiny web application that provides three map layers: COVID testing visibility across Île-de-France, IRIS-level socioeconomic conditions, and a high-resolution 200m Paris socioeconomic layer. Its main contribution is predicting better public-health surveillance interpreted with local socioeconomic context. Based on socio-economic conditions, policy makers should then better preventively protect the blindspot areas.
Review 1
The submission primarily deals with tracking testing and potential areas of testing blind spots in conjunction with socioeconomic indicators. This is of secondary importance for pandemic early warning, as it related to the detection and tracking of an outbreak already spreading and identified (hence the availability of targeted tests). While the methods seem robust, and it's true that low testing can mask the presence of a pathogen, particularly when better-off areas are testing more, the proposal is not very novel, as undertesting is a well understood phenomenon. The development of a dashboard to track this is potentially useful, although very much a proof of concept as develop here.
Review 2
- Including a concrete example from a takeaway or learning you had from your particular Île-de-France data would have added some value to show an example for the practical takeaways this approach can generate. - Some quantitative estimate of the degree of the discrepancy in testing and socioeconomic environment and the results on morbidity and mortality would be interesting to prove the scale of the problem.
Review 3
I think that the project is right to point out that there is lots of variation in testing and that matters for ID surveillance. I don't personally know the extent to which considerations like this were properly accounted for e.g. during COVID-19. As the author notes, this could be important for pandemic management (more-so than early warning, in my opinion). The project doesn't make claims about being validated etc. (and notes this as an opporunity for future work). To go from these maps to making decisions I agree that decision makers would want to see some kind of validation. For now this is more of an exploratory tool. I think in terms of general applicability for pandemic preparedness I wonder about the data streams (testing uptake) and how available they would be in a novel outbreak.
Project MOSAIC provides an open-source, Protein-Aware Mock Screener tool designed to defend against "context-scrubbed" multi-agent LLM workflows. We demonstrate that while simple DNA-level screening (Hamming distance) can be trivially evaded by generating synonymous codon substitutions using commercial LLMs (yielding 3 unique evasion payloads across 9 tested trials), our Layer-2 Protein Translation Screener catches 100% of these adversarial payloads by examining the translated amino acid homology, providing a robust defense against synonymous-only substitutions. This project explicitly models how malicious users might split a dangerous request into benign subtasks across different models, and provides the defensive code necessary to catch the resulting obfuscated sequences before they reach physical synthesis.
Review 1
I appreciated an attempt at a red-teaming pipeline and found the paper relatively clear to read. The use of multiple models in its pipeline was also appreciated. However, the obfuscation techniques and screening strategies proposed are not novel compared to state-of-the-art screening tools. In addition, while the paper proposes defensive screening for benchtop synthesizers, the bulk of the paper focuses on its red-teaming and screening strategies and not enough on implementation methods specifically into benchtop synthesizers. I would either hone in on that implementation or on novel red-teaming and stress testing strategies that can be done, building on methods used by current screening tools.
Review 2
It’s an interesting project that encompasses red teaming, tool development, and policy recommendations. The “multi-agent context-scrubbing” attack seems to be clearly described. The end-to-end approach here is very good, in that it starts with a vulnerability, exploits it, defends against it, and ties it back to policy. Empirically, the project has good design that includes the control group and uses cross-model testing on several frontier models. Defensively, it gives an applied screening argument which is further improved by using an open-source protein-level screen. Policy-wise, it talks about the risks of benchtop synthesizers which are highly relevant and one of the main bottlenecks of DNA synthesis screening, and the project presents somewhat useful policy guidance on this topic. Where it could be stronger: Novelty varies, where the biological principle that screening at the protein level works better than screening at the DNA level is common knowledge in industry and the use of HMMs in screening (commec uses it). While the project sometimes overstates the novelty of the work, claims such as "absolute mathematical guarantee" are quite hyperbolic, considering the authors' admission of non-synonymous evasions and structure-preserving mutations. The experiments are constrained by the choice of a benign sample (GFP), lack a test on more recent synthesis companies, modern screening pipelines, or even more sophisticated attacks.
DNA synthesis screening is a critical biosecurity chokepoint, but current tools detect dangerous sequences by similarity to known threats — a paradigm that collapses against AI-assisted protein design. Using ProteinMPNN and ESMFold, we generated 12,000 evasion variants of 50 known toxin proteins across 12 sampling temperatures, producing sequences with as little as 11% mean identity to known toxins. We evaluated ESM-C 600M protein language model embeddings as a function-aware alternative against BLASTp and commec baselines. At T=1.5, BLASTp detects 3.5% of variants and commec detects 2.9%, while ESM-C kNN maintains 79.5%. Among variants structurally predicted to retain wild-type function, commec detects 0% and BLASTp detects 3.8%, while ESM-C kNN detects 100%. We additionally identify organism-matched negative construction as a necessary methodological requirement for honest evaluation in this space, showing that naive dataset construction inflates AUC by up to 0.016 and FPR by 58 percentage points.
Review 1
I love that you have taken the next step in answering this question. Your approach builds on past knowledge but produces important additional insights. I was also glad to see that you map out clearly how your work impact our knowledge based and how it impacts necessary future research. I think you answered an important question and have directly contributed to efforts to strengthen synthesis screening against AI-base circumvention. It is a shame that you did not / were unable to have tested this against the full commec installation. It would have further increased the saliency of your work and increased the knowledge generated
Accurately measuring the biosecurity risks of rapidly advancing AI models is a critical security challenge. Current benchmarks rely on static, multiple-choice tests that act as "rule-out" evaluations, failing to capture realistic, multi-turn adversarial interactions. Conversely, high-fidelity human uplift studies are accurate but take far too long to match modern AI deployment cycles. To bridge this gap, we introduce Biorisk-gym, an automated, dynamic framework prototyping scalable "rule-in" evaluations. Our approach utilizes a three-agent architecture—an adversarial Gamemaster, a Target model under evaluation, and a Judge—to simulate multi-turn escalation across six stages of biological threat creation. We evaluated Claude Haiku 4.5, Claude Sonnet 4.6, and Llama 3.3 70B across 12 distinct attack scenarios. Results reveal stark, model-dependent vulnerabilities: Llama 3.3 70B demonstrated the highest mean peak biorisk uplift, whereas Sonnet 4.6 exhibited robust safety mechanisms. Crucially, nearly 40% of scenarios reached their highest threat scores on later conversation turns, demonstrating that single-turn evaluations systematically underestimate an AI's vulnerability to sustained adversarial pressure. We recommend next iterations of Biorisk-gym that aim to provide a powerful pre-release methodology for identifying dangerous latent capabilities. Finally, because these automated evaluations generate highly sensitive attack protocols, we propose a secure, tiered deployment roadmap to responsibly mitigate the resulting infohazards.
Review 1
The authors present an important perspective on the need for rule-in evaluations that are more efficient than human uplift studies, and that incorporate multi-turn dynamics. While the three-agent architecture (gamemaster, target, and judge) is in accordance with canonical approaches from literature, we are appreciative of the work. Two minor comments are that the Virology Capabilities Test includes the grading of open-ended responses with a rubric, and that limitations regarding compute credits are understood. More substantial adjustments that could benefit this work include precise descriptions of the experimental design. For example, how were specific personas selected for testing? Why were the Claude Haiku 4.5, Claude Sonnet 4.6, and Llama 3.3 70B models selected for evaluation? How is language model engagement measured? Also, the metrics used for evaluation require revision - for instance, the sum of specificity and accuracy should be justified or alternatives should be considered.
Review 2
This project produces a framework for multi-turn model assessment, elegantly quantifying a known phenomenon by showing that a significant fraction of scenarios escalate beyond the initial response. The extensive limitations and future work sections demonstrate real thoughtfulness on the part of the authors and should be commended. I would like to congratulate the team for this piece of work, well done! Accurate scoring is the hardest part of this pipeline, and the three criteria used (engagement, specificity, accuracy) only partially capture how much uplift a response actually provides. Additional criteria such as detail, actionability, and potential for harm would better reflect real-world threat models. The reasoning behind the score formula could also be made clearer. For instance, how would scores differ between a full virus production protocol with no numerical parameters versus one with all the details? Calibration against human scoring would help reveal where model judgments diverge. As acknowledged in the limitations section, scoring is bounded by the judge model's capability ceiling, self-preference bias, and refusals. These represent key challenges of the framework and are worth exploring further. The finding that biorisk assistance capabilities differ between models is well expected, so characterising the qualitative differences would add significant value to the findings. While limited time and infohazardous content are acknowledged, additional details such as a product demo, a benign example, the distribution of scoring, or high-level information on parameters like persona and scenario would strengthen the proof-of-concept.
Review 3
The rule-in vs rule-out framing is exactly right, and the finding that nearly 40% of scenarios peaked on later turns rather than turn 0 is the most policy-relevant empirical result in this cohort. Single-turn evals systematically miss real risk, and you've demonstrated that with real models. The ceiling right now is judge reliability: Haiku 4.5 evaluating Haiku 4.5 conflates escalation with same-family self-agreement, and the expert-grounded rubric path you outline is the right approach.
Quantifying risk at scale requires defining and prioritizing hundreds of causal paths to harm. I present an automated pipeline that (i) extracts causal chains from a single source document with an LLM, (ii) collapses near-duplicate nodes using embeddings and paired merge-proposer / merge-validator LLMs, and (iii) elicits Beta and PERT priors per node. Nodes in the risk model are then ranked by betweenness centrality, Birnbaum importance, and Expected Value of Partial Perfect Information (EVPPI), after Monte Carlo sampling. The pipeline is demonstrated on biorisk and serves as a proof-of-concept that LLMs can prioritize risk and indicate where new evaluations would most change downstream decisions.
Review 1
The problem area is neglected - we need more rigorous tooling for intervention prioritization in AIxBio, and this work addresses that gap. However, the intended audience and theory of change are unclear from the paper; there are some mentions of it in the limitations and future work section, but it should be clear for the beginning who this tool is directed at, how it should be used, what problem it solves and what the downstream implications are. “Granularity” is undefined (although the author notes this explicitly, which is appreciated). The paper would benefit from an operational definition and a sensitivity analysis. Using LLMs to elicit priors is risky, since LLMs are often overconfident and can produce numbers detached from reality, although I did not read the AutoEllicit paper and I can see how this is a necessary choice given the scope of the project. However, Gemini 2.5 Flash Lite is a weak choice for this, since it’s a step that is genuinely difficult for LLMs, which is why a frontier model (or several) should be used - I understand the budget constraints, but the paper would benefit from an acknowledgement here. Connected to that, I have a problem with how P(X | any parent active) is a single Beta but the actual conditional probabilities could vary substantially across the parents.. Either model parent-specific conditionals, keep the structure as a forest, or argue why the merged single-parameter approximation is acceptable. As written, the merge step invalidates the elicitation model. This is the most serious methodological issue I have with the paper. Standard betweenness sums over all s ≠ t, but in a risk DAG, only source-to-outcome paths are semantically meaningful. A source-to-outcome restricted betweenness (or flow betweenness) would be more appropriate. Notation is not properly introduced in the paper, which makes it hard to read and sometimes confusing, especially with some indexes colliding. Figure 2 is unreadable - the caption says "zoom for node labels", but those labels are unreadable at any zoom level. This is the only chance the reader has to see what the graph contains, especially with the pipeline being built in a way that does not allow easy replication.
Review 2
- A limitation for discussion is that biorisk modeling can be significantly based on classified information or infohazards, which can reduce the overall applicability. - Expanding towards taking in multiple sources. In the real world, multiple sources would inform overall evaluation priorities. I was not quite sure from the methodology if that'd be possible. - Explain why evaluation prioritization is a current bottleneck in biorisk reduction. - A paragraph on how plausible your results were would have been helpful. Do the recommendations broadly align with overall evaluation priorities? You could discuss how human graders would judge the outputs of the automated scoring and the plausibility. - A control condition where you compare this method to asking an AI Agent directly to extract the information and differences in outcome would have been interesting.
BioScreen is a function-aware biological sequence screening system designed to address a key biosecurity weakness in sequence-similarity-based DNA synthesis screening: adversarially engineered proteins can preserve harmful function while drifting far from known threats in sequence space. The project fine-tunes ESM-2 3B with a multitask objective combining binary threat detection, mechanism-of-harm classification, supervised contrastive learning, and PGD adversarial training in embedding space, so that sequences cluster by biological effect rather than superficial resemblance. On a 2,913-sequence evaluation set, the production model reports AUROC 0.998 and AP 1.000, outperforming both a 3-mer BLAST proxy and pretrained ESM-2 similarity baselines, while also classifying seven mechanism-of-harm categories with 88.6% mean per-class accuracy. The system also includes certified robustness analysis via randomized smoothing and a deployment profile showing 38–45 sequences per second on a single NVIDIA H200 GPU, comfortably above commercial synthesis throughput needs. The main limitation is dependence on curated UniProt Swiss-Prot data, which may leave gaps for novel synthetic proteins or underrepresented threat classes; the author therefore positions BioScreen as a first-pass filter that should be paired with expert review for high-consequence cases.
Review 1
You clearly stated the problem space and framed the work within it well, but a few things stood out that were unclear to me or unspecified, most importantly around the use of "function." - In the statement "organises the embedding space so that functionally similar sequences cluster together regardless of sequence identity," how were functionally similar sequences validated as such? Were these related wild-type proteins from UnitProt, or were these the generated variants? - If the adversarial variants were generated by ESM2 and also validated/evaluated against the model, wouldn't there be a bias impacting the results? - How (and why) were those 7 expanded threat categories labels selected and defined? As a potential limitation or future consideration: if considering that we should be sequence screening at multiple points of a design-build-test cycle, not just at synthesis, or considering where embedded solutions are needed, where do you see a solution like this falling in terms of performance, price, and more broadly, accessibility?
Review 2
The author identifies and works to a tackle a significant biosecurity challenge related to screening novel agents, and the notion of using AI to accelerate identification of harmful sequences seems useful. The technical assessment is beyond my personal knowledge to assess and a simplified explanation would help for a lay audience, but the nature of the contribution seems more geared towards technical audiences, so that's more a nice-to-have rather than a major problem.
Wittmann et al. (2025) demonstrated that generative protein design tools can produce variants of dangerous toxins that evade existing DNA synthesis screening by preserving biological function while rewriting amino acid sequence — a paraphrase attack. Existing screening systems fail because they rely on sequence similarity, asking does this look like a known toxin? rather than can this do what a toxin does? This project addresses an upstream gap: biology-oriented large language models (LLMs) that can assist users in designing such attacks. We present a four-layer, model-agnostic biosecurity screening system that sits at the interface of biology LLMs and prevents them from being used for protein paraphrase attacks. Our system combines intent detection, LLM-based self-critique, functional toxicity motif analysis, and ESM-2 embedding-based functional similarity scoring. Evaluated using ProLLaMA as the underlying biology LLM, the system blocks 96% of attack prompts while maintaining a 0% false positive rate on legitimate biology queries. Without screening, ProLLaMA answered 100% of attack prompts freely, demonstrating the critical need for this intervention.
Review 1
This project addresses an important problem with a commendably rigorous approach. However, the current results show only limited success. The intent‑detection layer demonstrates partial effectiveness, as it still struggles to identify evasive prompts that avoid explicit toxin names, which is acknowledged. The function‑based screening using motifs and token embeddings moves in a promising direction, but the present implementation lacks sufficient specificity. Given the constraints of a hackathon timeframe, this effort is appreciated.
Review 2
Good work for creatively choosing a problem to solve. Models like BioGPT/ProLLaMA are neglected by the biosecurity community and some of your findings here could potentially contribute to safer interfaces with such models. Given their open source nature, though, I'm not sure how tractable solving the uplift-from-bioLLMs problem is.
Review 3
This is a practical and well-scoped project. You identify a real failure mode in current biosecurity systems and place the defense at a high-leverage point, the LLM interface. The layered design is a strong engineering choice, and the evaluation clearly shows that the intervention changes model behavior in a meaningful way. The main weakness is confidence in the results. A 96% block rate with 0% false positives suggests the evaluation setup may be too clean. Real users will not submit obvious “attack prompts”. They will obfuscate intent, split tasks across multiple queries, or mix benign and harmful goals. Also consider system-level limitations. A model-agnostic wrapper is useful, but attackers may bypass the interface entirely or fine-tune their own models. Address where this fits in a broader defense-in-depth strategy. If you can demonstrate robustness under adversarial conditions and clarify the biological validation layer, this could move from a solid engineering solution to a deployable safety system.
BioRefusalAudit measures whether a model's refusal is structurally real or just surface-level. Using sparse autoencoder interpretability( both off-the-shelf Gemma Scope SAEs and a biosecurity-specific SAE trained during the hackathon) it computes a divergence score between what a model says and what its internal activations show. Key findings: Gemma 2 never genuinely refuses, it hedges. A single chat-template token takes Gemma 4 from 65 refusals to zero. Both models refuse nothing at 80-token caps. And the refusal circuit fires harder on psilocybin (biologically benign, Schedule I) than on genuinely hazardous biology. None of this is visible to surface evaluation. Runs on a 4 GB consumer GPU. Dozens of different trial runs with various models, contexts, and controls, showing consistent results. Code and data open yet secured under Hippocratic License 3.0 with select modules highlighting its usefulness for biosecurity and AI safety research.
Review 1
General notes * Interp techniques to see whether refusals are robust or easily bypassible seems broadly useful. The methods seem somewhat shaky – the feature catalog is populated purely based on statistical selection (unclear whether these distinctions are meaningful), contrastive SAE seems to not have achieved the seperation it was optimized for, etc. Difficult to know what conclusion to draw from the specific technical approach implemented here. * The finding about the Schedule I pschyadelic probe is the most interesting, suggesting some kind of representation of cultural tabooness. * Writing is quite LLM-y and therefore a bit sloppy/annoying to read. Very dramatic, lots of "It's not X, it's Y", colons, etc. Less text but with more substance would be preferable. Main findings/takeaways are quite buried. Policy connections and review/understanding of previous literature seems surface level. * Would have been nice to see some example prompts or method behind prompt generation. * Recent relevant work the author may be interested in: https://securebio.org/biotier/ Small notes * (Intro P1) LAB-Bench is not designed to measure dangerous biology proxies, just general bio knowledge. * (Related work, policy framing) Not sure how this work addresses "tiered access" from Yassif and Carter? It's a refusal benchmark. Tiered access is quite a different system.
Review 2
This started out as a pretty compelling research project that clearly described a big problem (i.e. the difference between model behaviour in terms of the chat output vs the underlying activations), and the major findings were laid out well in the Abstract. Although the technical scope is a little outside my expertise, I found that the rest of the report was extremely jargon heavy and seemingly reliant on LLMs to generate the text, which made me suspicious about the analysis of the results. Overall this ‘felt like’ a good contribution to interpretability w.r.t biosecurity, but a more technical judge with familiarity of interpretability would have a fairer view. Some pointers Some explanation in the intro was clunky ‘The question is: when a model refuses, does it can’t, or does it merely won’t right now’ → I think I get what you mean, but a few phrases were pretty confusing. The more I read of this document, the more text was either extremely jargon heavy or LLM-flavoured (e.g. the entire Related Work section, and particularly the ‘Policy framing’ section seemed pretty LLM coded). Then in the results and limitations/discussion, whole passages seemed very LLM-y to me, which made me question the appraisal of the results. The D metric is pretty important, but not particularly well explained. You formally define it, but that’s not that helpful to non-AI/maths types. Even for a somewhat AI-literate person, I wasn’t left with much intuition about what D is actually measuring, or how novel or valid it is (I imagine something similar has been done for other interpretability efforts?)..
Review 3
I like how sharp the contribution of this methodology is. I would suggest restructuring the framing, perhaps leading with behavioral findings (format gating, schedule I findings, etc.) since it's more robust compared to the metric D centerpiece framing. I also think that the single model family limitation is actually more significant than given light to, the mechanistic claims made here rest only on one model. The dual use consideration is well handle.
Multi-signal pandemic surveillance combines wastewater, search-query, information-seeking, and clinical signals under the assumption that adding sources improves detection. We test this assumption across the multi-year endemic transition of COVID-19 and contrast against influenza. Across attention-based signals (Google Trends, Wikipedia) we find 5–23× variance compression after the first major COVID-19 wave but no compression across flu seasons; wastewater is the only signal type with stable variance across both diseases. The diagnosis is a novelty cycle in public attention specific to emerging pathogens. We additionally report a corpus-scale negative result on LLM-prompt surveillance using WildChat-4.8M (3.2M conversations), and a system-design proposal for privacy-preserving aggregate releases.
Review 1
This was a really exceptionally executed and presented project, with some very clear explanations and compelling data visualizations. It was also an impressive amount of work to have achieved in the timeframe. On style and presentation, my only gripe was that the work was structured and began quite formal and academic, but then transitioned to a more 'bloggy' tone. Though the clear explanations and signposts made up for that. Other points of improvement were 1) true relevance to the problem in the Track and 2) the methodology/results felt like they were somewhat overcomplicated/contrived. On point 1), the work mainly considers how signals change during an outbreak that became much less catastrophic over time, rather than contributing to a true early-warning system or initial detection. This is useful from a surveillance perspective, but is quite a bit less marginally useful than if the work addressed pre-outbreak warning systems. It would have been interesting to test whether there were any signals that reflected the changing virology/epidemiology of the COVID-19 variants (e.g. do any of the input signals reflect changing symptoms/lethality/transmissibility for which we have good data for a given variant). On point 2), while impressive, the project often felt like it described quite 'common sense' findings, e.g. people got 'COVID-fatigued' and stopped searching things on Google. There was a lot of fairly advanced mathematics and statistics in this study that were outside of my expertise, so I would liked to have seen more application of Occam's razor - some simpler descriptive statistics might have painted a similar story and been more accessible to a wider audience.
Review 2
Report shares finding that online user signals of attention to a particular pathogen do not reflect ground truth levels of incidence in the long run. Report notes that the historical decay of public attention to COVID-19 relative to its incidence appears to a phenomenon stemming from its prior novelty as an emerging pandemic. Report highlights that wastewater surveillance remains a reliable signal of a pathogen's incidence, unlike online signals of attention to the pathogen, which may not always correspond. Code repository link in PDF submission did not work, at least on the submission review platform. Repository could not be found on the submission author's personal GitHub page. Report appears generated at least in large part via LLM assistance, though a note on how AI-technologies were used in the submission was not included. Implications of report findings are not highlighted enough- with a document of this length, it is extra important to make the take-aways clear.
Today, any curious mind can open the laptop, design a novel enzyme, order it synthesised, and have it on a bench before any registry knows it exists. This is an extraordinary scientific advancement, but without the right infrastructure, a biosecurity problem waiting to compound. Generative AI is producing novel proteins and genes faster than the field can catalogue, evaluate, or attribute them. No shared infrastructure exists to distinguish AI-designed sequences from naturally occurring ones, screen them for biosafety, or credit their creators. ArtGene-Archive (artgene-archive.org) is the first dedicated registry for AI-generated biological sequences. Every submission passes an automated three-gate biosafety pipeline, receives a cryptographically signed certificate anchored to a tamper-evident audit log, and is issued a citable Registry ID. Built on experience at the European Genome-phenome Archive and grounded in emerging AI biosafety research, this dedicated archive solves a specific structural gap: provenance and safety certification at the point of design, not the point of discovery. What it needs now is what GenBank needed in 1982 - institutional commitment, knowledge contribution, and collective adoption.
Review 1
This is a well-presented and executed project. However, despite the very personal description of the motivation, I struggle to see how this project would specifically help prevent the misuse of AI-enabled biological design. ArtGen Archive is built as a voluntary database that can help scientists take ownership of their projects, and the built-in biosafety screening is commendable for avoiding accidental biosafety mistakes. Malicious actors would likely just not use the platform, and it iI do not see a strong pathway by which this paltform lowers malicious actors' ability to access dangerous pathogens or otherwise reduces the likelihood of deliberate release of biological agent. Without such additional justification I believe that this project is off-topic for this hackathon, despite its excellent execution.
Review 2
This is an impressively creative proof of concept that elegantly demonstrates how protein language models, biosecurity screening, watermarking, certification, cryptography, and blockchain could be combined to build an infrastructure for attributing AI‑designed biological sequences. It’s remarkably well put together, especially given the short timeframe of a hackathon. In an ideal world, a system like this should already exist. Unfortunately, reality is far more complex, and there are numerous practical, technical, and governance challenges that make deploying such an approach extremely difficult. And if even parts of this vision eventually become feasible, they could contribute to stronger biosafety, though not necessarily biosecurity. Even so, it represents a meaningful step in the right direction.
Review 3
Feels like a solution looking for a problem.
Frontier language models are increasingly being used in biology research contexts, but most of them rely on safety fine-tuning as the primary defense against misuse. We wanted to know whether that defense actually holds up when you change how a harmful request is framed rather than what it is asking. To test this, we built a simple evaluation framework and ran 15 prompts across three categories, benign biology questions, direct misuse queries, and adversarially rephrased versions of the same misuse queries using professional and academic framing, against both Claude Sonnet 4.6 and GPT-4o with no system prompt. We found that adversarial rephrasing eliminated all full refusals across both models. GPT-4o was significantly more permissive under adversarial framing, fully complying with 3 of 5 rephrased queries compared to 1 of 5 for Claude, meaning an audit using only direct queries would reach the wrong conclusion about which model is safer. Both models fully complied with an anthrax weaponization query when it was framed as historical journalism research. We also attempted automated labeling using a second Claude instance as a judge and found only a 10 to 20 percent success rate, suggesting models apply their own safety heuristics when evaluating bio-relevant content and produce responses that break structured output format. We release our task set and evaluation framework to support future biosecurity auditing work.
Review 1
Overall, the project is well scope and clearly communicated, although not that innovative. The paper tests how different questions and their framings (adversially disguided, directly question, harmless) impact refusals, finding that adverserially disguised questions are more likely to be answered. This has been done in other contexts, including recently for bio (https://securebio.org/biotier/). The methods could have been improved by using a newer model than GPT-4o (makes the results less relevant) and evaluating each model on each question multiple times, capturing some of seeming stochasticity in reffusal behavior. Misc notes: * Appreciate the manual labeling of prompts – gives me more confidence in the results. * For the "partial" response, it would have been nice to expand. Plausibly, the safety policy is actually quite good at letting models answer misuse-relevant questions in a manner that poses little direct risk (e.g., to vague to be practical) while not refusing out-right. That seems like potentially the right choice?
Review 2
This is useful and informative work but does not add too much novel. It does show certain adversarial prompting approaches can bypass model level refusals but, as pointed out in the 'dual use concerns' section this is fairly well known in the field already. It does provide some data around that on two frontier level models and highlights how direct vs adversarial framing can change conclusions, but this is not highly novel. More prompts across more models would strengthen it and the argument evaluations in this manner should be performed. They are, but perhaps the argument in the work is this is important and should be performed more? However, much of that work is not public. The write-up was clear and easy to follow. The tables were useful but graphs could also have been nice to have for quick skimming and to summarize data. It is listed as a limitation that there was no system prompt but that is more a strength in my mind. Many Biosecurity relevant evals are done with no system prompt to gauge model safety at weight-level or performed with and without to compare. Either way, that was the right approach for this paper and not a limitation.
Review 3
Clean, honest work. The cross-model finding is the most valuable contribution: on direct misuse GPT-4o looks slightly safer than Claude, but under adversarial rephrasing GPT-4o is significantly more permissive, with 3/5 full compliance versus 1/5. An audit using only direct queries reaches the wrong conclusion about which model to deploy. That's actionable and undersold. The LLM-as-judge failure is a real secondary finding. Models applying safety heuristics when asked to evaluate bio-relevant content, breaking structured output format instead of returning JSON, is a practical obstacle for scaling this methodology and appears to be a general pattern, not a one-off. The no-system-prompt condition is the main constraint. This is a worst-case baseline that doesn't reflect most real deployments. The findings need replication under realistic operator conditions before they become deployment recommendations. 15 prompts, single annotator, subjective partial/comply boundary. The paper doesn't overclaim. The framework release is the right call, a repeatable audit tool has more lasting value than the specific findings at this sample size.
DNA Provenance Passport is a “code-signing for DNA” prototype that lets synthesis providers verify whether a DNA design came from a trusted researcher, remained unchanged after signing, and should proceed to normal screening or be routed for review.
Review 1
You identified a specific challenge for biosecurity and describe an innovative solution - effectively modernising customer screening approaches, rather than relying solely on order screening. This is a genuinely interesting and innovative approach. I particularly appreciated you consideration of sensitivities over the confidentiality of proprietary sequences as well as the application to benchtop synthesis devices. Both are particularly innovative and address real world challenges. I would like to have seen more evidence to support the claims of codon shifting/ optimization effectively circumventing sequence homology approaches. I think using tools to engineer proteins to retain function but with a significantly different sequence have been demonstrated in various publications. Key references to these are missing.
Review 2
Please never ever typeset an entire paper in italics. Literally half of the submissions I've reviewed were entirely set in italics, and I suspect it's because everyone was working from a submission template which used italics to give instructions about what to say in various sections, and you just inherited that formatting. But you should be more vigilant; please don't make reviewers' eyeballs bleed. :) This paper feels like all the hard problems are pushed out of scope. A variety of [citation needed]: (a) p.3. "as we demonstrated" where? (b) "codon optimization can make it fall below threshold" isn't true. There are screening solutions which look at the AA's and not the nucleotides and therefore codon optimization doesn't affect them. (c) You cite SecureDNA but seem to miss that there doesn't *necessarily* have to be a tradeoff between screening and privacy, especially for orders which do not contain hazards. (d) Calling IBBIS's CM "state of the art" requires some citation backup. There are many screening solutions out there. On page 4, your "verified attribution" is basically what SecureDNA's "verified screening" mode does already; this paper is just a recapitulation of that mode. You claim to use ECDSA, but give no justification for why that in particular. DSA in any form, including ECDSA, is *incredibly fragile* and very easy to get wrong; in particular, reusing any nonce exposes the key. There are a number of much better choices which do not have so many sharp edges, so the choice of ECDSA here is curious and unwise. You didn't fill out any of the Appendix or (especially!) the "LLM usage" sections at all. This is both careless and also means that it's not possible to know how much of this paper was LLM-generated. You also didn't remove the template instructions from "Code and Data" and from "References."
The decreasing cost of DNA synthesis and advances in generative protein design raise concerns about evading biosecurity screening systems. We present BSS-Breach, a benchmarking framework for testing screening tools against both conventional sequence modifications and AI-generated variants. Using a pipeline of transformations, including synonymous substitutions, padding, splitting, and diffusion-based sequence generation, we conduct a penetration test against a biosecurity check tool ComMec. Our results show that while standard manipulations are reliably detected, all synthetic variants generated via diffusion models bypass screening. This exposes a key limitation of current approaches, which rely primarily on sequence similarity. These findings highlight the need for next-generation screening methods, as well as benchmarking and pen-testing toolkits to evaluate these.
Review 1
The 'Related work' section helped to set the work in a wider context nicely. Seems like a valuable contribution to efforts in the space. Maybe it's not super original, but I'd really like to see the work expanded.
Review 2
This is a useful and grounded project. You focus on evaluation rather than speculation, and you demonstrate a concrete weakness in current screening systems. The comparison between traditional manipulations and AI-generated variants is especially effective, it clearly shows where defenses break. The main limitation is experimental depth. Right now, the result is binary: detection vs. bypass. To make this more impactful, you should quantify performance. Report detection rates, sample sizes, and variability. Show whether some generated sequences are closer to detection thresholds than others.
Sequence screening has matured faster than portable requester authorization. Whether a requester has been reviewed and remains authorized to make the request is still answered ad hoc at every provider, AI-bio tool, and equipment vendor. KYR-Bio is a local-first prototype of that missing layer: a reviewed, scoped, holder-bound researcher authorization that AI-bio tools, synthesis checkout flows, and benchtop DNA synthesizers can verify locally without re-sharing the applicant dossier. A single human-reviewed decision is packaged into a scoped, signed credential that a researcher's wallet presents to any participating relying party. Verifiers run schema, signature, issuer governance, Bitstring Status List freshness, holder proof and challenge binding, and scope checks before a policy adapter applies local rules. Audit events carry a hash chain and rolling Merkle root and exclude raw biological prompts, sequences, and reviewer notes. The synthetic evaluation passes 8 persona, 22 verifier, and 18 AI-assistance cases.
Review 1
This paper is a helpful specification document for what a KYC scheme could look like in reality. However, the core idea presented in the paper is not technically novel, and more importantly, is not the crux towards solving the problem of customer screening. If I understood the presented approach correctly, the main advantage over relying on existing authentication methods is mostly some privacy improvement. While these could help increase uptake of KYC schemes, they do not inherently address security issues. Additionally, the paper could be improved by being more concise.
Review 2
This project is addressing a meaningful gap and it is quite comprehensive for a hackathon. It is great that the author thought about KYC automation while keeping robustness and avoiding redundancies. It has a good systems-level work with a nice layer of portable credentials, separation of roles, and the decision of having local verification with verifier-specific policy. I also liked the end-to-end prototype with onboarding to audit. I would be curious to see this system being upscaled in real life. Aspects which need improvement include: a better rubric of who are trusted issuers, how is trust bootstrapped internationally and how are disputes handled. It also seems that evaluation metrics are somewhat limited. “22/22 cases pass” is promising but not very informative as it lacks false positive/false negative analysis, usability or latency measurements.
Review 3
This was a lot of work for a solo build, nice! The most useful next step is probably a live deployable demo with one real participating verifier, even just a single AI-bio tool's auth flow since currently everything is local + synthetic personas.
We present BioClaw, a modular agent framework for biological engineering workflows. We demonstrate it by converting human insulin (UniProt P01308) into validated, expression-ready DNA constructs for E. coli. Starting just from a protein accession, the pipeline retrieves the sequence, optimizes codons, assembles expression cassettes, builds annotated constructs, and runs seven-point validation. The insulin construct passes all checks after one remediation round. BioClaw is built on LangGraph as a two-level state graph with composable nodes, typed state, human-in-the-loop checkpoints, and a full audit trail. The core pipeline is deterministic: an LLM is only invoked as a bounded escalation agent when rule-based remediation fails. New validation checks, pipeline steps, or expression hosts can be added through a repeatable extension pattern without restructuring the graph. We discuss the properties that make this approach well-suited for safety-critical biological workflows and show that composability, auditability, bounded AI agency, and testability emerge naturally from the graph-based architecture.
Review 1
A very competent combination of existing tools to partially automate a multi-step process. Motivation and presentation are pleasantly clear. Scope is limited: one expression organism, mostly deterministic plasmid structure, etc. From the report, it's not clear how important the non-optimized features are. The demo report is slick, but I wonder how much the results are aided by the chosen demo protein being exceptionally well-studied.
Review 2
Very interesting approach, from a general bioscience / bioengineering perspective. Unfortunately, it does not directly address biosecurity or biorisk, so I am not qualified to speak to its utility in the more general context.
Current biosecurity protocols suffer from two fundamental vulnerabilities: 1)Obsolete Checks: relying on sequence matching algorithms like BLAST which are easily bypassed by Novel AI generated sequences and 2)End-Point Only screening: By waiting to apply these checks until a physical order is placed at the commercial sink, defenders forfeit the ability to intercept these attacks at their digital generation phase. We propose moving security guardrails upstream into the generative pipeline itself. Our approach is two-fold: 1)Mapping two attack paths: structural design via diffusion models and genomic sequence generation via DNA foundation models, 2)Showcasing the successful use of Interpretability techniques to implement the security guardrails on upstream components of the attacker’s pipeline. As a functional Proof-of-Concept we trained an L1-regularized probe on ProteinMPNN’s decoder representations, successfully classifying pathogenic geometric intent prior to sequence translation (0.86 ROC-AUC). Crucially, the learned representations track generalizable biological threats rather than family specific or physical heuristics. Our probe correctly classified distinct, structurally unseen toxins (Ricin etc).We conclude that securing generative biology requires representation-level intervention and outline an architecture to scale these guardrails via Sparse Autoencoders and residual stream monitoring.
Review 1
Re the argument on end-point-only screening. It’s a fair point but also the reason for this is that it’s a key physical chokepoint. How would you actually implement upstream screening during the generation of novel sequences when models like evo2 are open source? A discussion of how these techniques could be implemented in practice is missing and would’ve been really interesting + valuable. How good is a 0.868 ROC-AUC actually? Are there any relevant comparisons that put this into perspective in an intuitive way? The Evidence of Cross-Family Biological Generalization is great and intriguing. I am curious to hear about hypotheses dim 103 activates for these toxins (and why for human insulin). Overall I think this interpretability approach to genomic language models and sequence design models is really interesting and promising! Good work. IMHO your submission is a bit too biased toward arxiv-style academic writing. That's great for a very particular researcher audience but as a judge who is more in policy-world, it's a bit hard to follow. I think you could take inspiration from the writing of research outputs at places like METR or GovAI who do a great job at writing with rigorous clarity that is still accessible to non-experts.
Review 2
This project clearly describes the problem being addressed and proposes a mitigation framework, with the video demonstration neatly showcasing the GUI. Thoughts were clearly put into curating the datasets, with deliberate attempts to address potential flaws such as models identifying any binding interface as malignant. I would like to congratulate the authors for this piece of work, well done! That said, the benign dataset is limited to only three subclasses of targets, and the set of malign proteins used in training and testing is limited. For instance, only viral glycoproteins and toxins are considered. This limited coverage constrains the relevance of the findings. Failures across the small number of proteins tested also indicate high false positive and false negative rates. Given that "the successful use of interpretability techniques" is the project's main conclusion, the supporting empirical evidence is relatively weak, and this should be acknowledged in the report. While this is understandable given the time constraints, the claims and proposals are stronger than the technical evidence presented.
Review 3
This project has a strong and relevant core idea. Moving guardrails upstream into generative biology pipelines is the right direction, and using representation-level signals is a smart approach. The proof-of-concept with a probe on model activations is a good starting point, and the generalization claim is particularly interesting. However, the current work feels like an early prototype rather than a robust demonstration. You need to tighten the experimental story. Define what “pathogenic intent” means in measurable terms. Compare against existing screening approaches to justify the shift upstream. Stress-test the system against adversarial or ambiguous cases, since attackers will adapt.
SynthGuard + BioLens is an AI biosecurity system for discriminative DNA/protein synthesis screening and auditable analyst triage. SynthGuard improves beyond identity-only sequence matching by using k-mer, codon-usage, codon-adaptation, amino-acid, physicochemical, and ESM-2 embedding features to better separate hazardous from benign sequences. In verified BLAST+ benchmarks, SynthGuard substantially reduced false positives while retaining strong hazardous recall, including on out-of-distribution hazardous families. BioLens adds the operational layer: a nine-surface dashboard for screening intake, case review, operator-curated intelligence alerts, watchlists, automation, audit logs, analytics, and reports. It preserves raw model scores while applying bounded intelligence-aware triage modifiers for transparent review. Together, SynthGuard and BioLens address both sides of synthesis biosecurity: stronger technical screening and a practical workflow for human analysts.
Review 1
I like the problem area and the use of real BLAST baselines. Future work could benefit from evaluating chimeric constructs and de-novo backbones to further stress-test the screener's resilience.
Review 2
The multi-layered system is thoughtful and intelligently provides information on what layers are actually important to attaining optimal results for functional screening. You guys did good work for a two-day hackathon and could feasibly turn this into a great open-source tool for synthesis screening with more effort. If interested in pursuing this line, I would research the extant functional screening pipelines and record what variables actually led to high performance in the different versions of SynthGuard (i.e. post-training on Brucella abortus, adding ESM-2 embeddings).
Review 3
The main weakness of this project is that its central elements, the DNA and protein classifier models, are barely described at all. It's impossible to tell whether there's anything particularly novel involved. The system's performance comparison with baseline metrics is uninformative, because insufficient information is given about the baseline approach. The numbers presented for the baseline are absurdly bad, and don't plausibly represent what real current screening technology does, so this looks like a "straw man" comparison. More information about the datasets is necessary. From what source(s) were the sequences drawn, and how? Are they representative of authentic synthesis orders (which are dominated by synthetic and heavily engineered constructs, and mix protein-coding segments with promoters and other components), or natural genes? Why were all hazardous examples in the DNA training set drawn from toxins, and not pathogens? I did not pay much attention to BioLens, or to the description of the API and model fallbacks. None of that seems particularly innovative, and infrastructure/UI development is secondary to devising and validating effective screening methods.
Current DNA synthesis screening systems rely primarily on sequence homology to detect biosecurity threats, creating potential vulnerabilities to sophisticated evasion strategies. We developed a complementary protein embedding-based screening approach using ESM2 to detect functionally similar but sequence-diverse threats. Using ProteinMPNN, we generated toxin variants with below 60% sequence identity to known toxins while preserving 3D structure. Our key finding is that these sequence-diverse variants cluster significantly closer to original toxin families than to neutral proteins in ESM2 embedding space, suggesting that functional relationships are more conserved in embedding space than sequence space. This demonstrates the potential for embedding-based screening to identify threats that evade traditional homology-based detection. We built a two-layer screening pipeline combining SecureDNA with ESM2 similarity analysis and conducted preliminary evaluation on the NIST nucleic acid synthesis screening dataset. Our results indicate that multi-modal screening approaches could provide more robust biosecurity coverage against advanced evasion attempts.
Review 1
This project addresses an important gap in biosecurity, namely whether modern protein engineering methods are able to bypass DNA screening, and how to recognize them. It correctly poses the problem of recognizing that similarity of sequences does not imply similarity of function, which matches the actual challenge related to ProteinMPNN-type attacks on current systems. Using ProteinMPNN for generation, ESM2 for encoding, and SecureDNA as a base system is a good choice. The main result is strong, with the identification of sequence-based diverse variants clustering together with toxins in embedding space, especially the quantitatively strong finding of 2361/2440 variants falling near the toxin cluster. Comparing the SecureDNA system to its failure at recognizing redesigned variants and getting 0% detection success makes the results even more relevant, but there is a strong dual-use case of this methodology. The two-stage process of homology and embeddings positions embedding techniques as complementary to existing systems, not replacing them at all, which is great. Where it could be stronger: The core idea, that is embeddings capture functional similarity beyond sequence, is already quite established. I also found the classifier a bit too simplistic, although this could be improved if the project was further pursued. Also, scalability and deployment are not deeply addressed. ESM2-650M embeddings are computationally expensive and there is no discussion of throughput constraints and real-time feasibility projections. Lastly, I found the first image to be simply unnecessary.
Review 2
Good problem selection - we should be developing more AI based tools for sequence screening. Cool 2-component AI methodology in using Protein-MPNN to generate potential sequences and then generating embeddings for these sequences, which was a cool idea. It was a shame that more results couldn’t be produced. Couple pointers (apologies if it’s a bit terse): Figure 1 was goofy and didn’t add much, and especially as you are recreating it from another publication I think this wasn’t necessary. I had some confusion about what was in your dataset. Initially you said that you were using ‘toxins’ as test cases for dangerous sequences, which seemed reasonable. But then you afterwards mention several more ‘toxins’ as SARS-CoV-2/Sindbis/Influenza. These are very much not toxins, these are viruses. Overall I was kind of confused about what was actually in your data. The data don’t look particularly promising (?). The PCA plots sort of look like everything smooshes together to me, rather than cluster nicely as you suggest. And why would they? The data selection doesn’t seem particularly well considered - I do not see what biological basis we could have to expect that a completely sequence-redesigned influenza NP protein would cluster in a similar space as a redesigned Cholera toxin, or anything from the database you cite (https://www.uniprot.org/keywords/KW-0800). There’s no such thing as ‘toxin space’ vs ‘non-toxin space’, so the first PCA is a bit mute imo. It’s always a good idea to ask 1) what is the simplest experiment I can do and 2) what is an obvious benchmark to test against. Making fancy LLM embeddings of proteins is…fancy. What would happen if you just made a one-hot encoding? Would that actually perform better than the fancy high-dimensional embeddings? You did better on point 2) using SecureDNA as a screening tool comparison If I’m not mistaken, you don’t actually show any data on how accurately your embeddings could flag redesigned sequences? Seems like you ran out of time and ran into compute constraints - hope you manage to see the project through to more completion! Limitations you missed/should have expanded on more: Ultimately, we don’t know whether Protein-MPNN is spitting out credible structures unless we make them and test them empirically. You mentioned that we cannot know whether end functionality (e.g. virulence) is retained, but I think we also have to be cautious about saying that the underlying 3D structure would also be preserved, without much empirical evidence (and when we do have evidence, this is heavily curated). Could you pare things down to just select agents that are known to be screened against? Rather than using SecureDNA with loads toxins. You use ORFs, but there are plenty of ‘danger signal’ sequences not in ORFs - particularly in viral UTRs - that you wouldn’t (and couldn’t) have modelled here.
Review 3
You have identified a clear problem area and provided a conceptual solution. I do think more time could have been used either researching or referencing past work to complement yours though. As examples: - https://biolm.ai/models/biolmtox2/ - https://www.biorxiv.org/content/10.1101/2024.12.02.626439v1 - https://www.biorxiv.org/content/10.1101/2024.07.05.602129v1.full Your paper was easy to understand, and you highlighted limitations and dual-use concerns, which is very important. I do think more caution should be used as some claims or language surrounding claims were either unclear, or exaggerated. As examples, "variants with 0-60% sequence identity while preserving the original three-dimensional structure" it was not clear how 3-D structure was validated, and "provides strong evidence that ESM2 embeddings capture functional signatures" sounds very strong relative to no functional validation being done given the short time you had.
Function‑Prediction Screening for Protein Hazards: ToxScreen uses protein language model embeddings to detect hazardous proteins by predicted function rather than sequence similarity, catching AI-designed toxin variants that evade current DNA synthesis screening. Evaluated on 10,021 proteins with homology-controlled splits, it achieves 0.999 AUROC and 96.7% detection at 1% false positive rate on sequences sharing less than 40% identity with training data.
Review 1
Pitch whitepaper reads more like a research paper with results more than a hackathon proposal. Content and images seem decorative. Very high AUROC.
Review 2
Thanks for this submisison! The empirical validation work has value. I think your report could benefit from succinctness, and fewer, higher impact charts & diagrams. I've also docked some points in the innovation department as I believe you have mostly built on top of existing tools and technology, but the tiny generalization gap finding is exciting.
Review 3
Solid initial exploration into using embeddings to discriminate harmful from non-harmful protein sequences, with a very visually pleasing report. The descriptions, motivations, and methods are clear and convincing. It would have been nice to see a comparison to a purely-sequence baseline (BLAST or similar) rather than or in addition to the average-property physiochemical baseline. The claims about generalization are compelling but not fully substantiated; the ESM based model seems not to be just memorizing sequences, but that doesn't mean it's learned the full notion of danger we care about; it may be learning more general characteristics unique to the training data. The report is very nicely arranged and readable, but would benefit from a review pass to remove AI-human collaboration seams.
A gradient-boosting classifier for fast, short-read DNA threat detection. Uses a hybrid DNA/protein k-mer approach and reverse-screening post-filters to accurately identify hazardous sequences while generalizing to novel organisms.
Review 1
I would’ve appreciated a discussion in the intro why screening below 30bp might matter in practice. Is it a plausible threat vector that malicious actors could stitch together 25bp DNA pieces to boot up dangerous pathogens? Doesn’t that take way too long? Due to this I’m a bit skeptical of the utility of screening <30bp in practice (esp. considering the risk of false positives) but your approach is certainly interesting and it’s good to know that screening below 30bp is plausible if this turns out to be an important threat vector. It would be nice to have a comparison of how good AUC of ~0.8 actually is. How does this compare to SecureDNA/IBBIS? (Maybe you can’t assess their AUC) This is really interesting if true! “The protein features tell a different and more biologically defensible story. Of total feature gain, 73% comes from protein k-mers and 27% from DNA k-mers. Top protein k-mers include hydrophobic and aromatic clusters consistent with transmembrane and aromatic-binding regions of bacterial proteins, suggesting that the model is learning real biology at the protein level even while relying on compositional bias at the DNA level.” It’s a valuable finding that the classifier fails on phylogenetically distant organisms. However, I think the problem here is false-positives, right? And false positives are likely the most costly aspect of implementing a screening mechanisms for companies since they need to manually investigate flagged positives. Would be good to flag this in the discussion.l I would’ve liked to see a discussion of the utility of this approach for engineered pathogens. When screening pathogen sequences that are modified, do they fall out of distribution and aren't detected anymore? That would be a critically important false negative. The over-representation and limitation for viral sequences is a shortcoming given the importance of engineered viruses as a key threat pathway. Could this be fixed with more viral training data? IMHO your submission is a bit too biased toward arxiv-style academic writing. That's great for a very particular researcher audience but as a judge who is more in policy-world, it's a bit hard to follow. I think you could take inspiration from the writing of research outputs at places like METR or GovAI who do a great job at writing with rigorous clarity that is still accessible to non-experts.
Review 2
While I'm not sure that short/oligo sequences are as big of a risk as people currently believe, this seems like a positive contribution to the space.
We propose Sentinel Atlas, a centralized platform that aggregates fragmented, multi-source pathogen and wastewater surveillance data into a standardized, open-access infrastructure to enable real-time epidemiological forecasting and proactive pandemic defense.
Review 1
- Include an example for a finding that was uniquely possible through your platform. That way you can show a concrete application and highlight the marginal value this tool provides. - Try to make the UI on the website a bit more intuitive - I played around with it for a bit but was never able to get to the map view. - Small thing but follow the WHO recommendation to say mpox over 'monkey pox' - A lot of DC policymakers are talking about the need for this integrated biointelligence you provide, so that's great!
Review 2
The project correctly identified that it's hard to get any kind of harmonized pandemic data that has useful granularity. I also think the predictions element could work very well, although you have to be careful with the quality threshold you set for these. I really liked the live/stale status tagging, as many of these systems are only maintained for a short time. I think this sort of thing has been tried a couple times (EpiGraphHub, Unified COVID-19 Dataset) so I think engaging with why previous efforts stalled would substantially strengthen the contribution. You don't really make an argument for data fragmentation being the primary argument, you just make the claim. Unpacking this would be useful.
Review 3
Solid infrastructure work addressing a real problem. The honest framing is that this is a proposal with early infrastructure, not a validated platform. The crowdsourcing pipeline is the most novel contribution but is unvalidated. There is no demonstrated end-to-end test — no submitted forecast, no approved pull request, no leaderboard entry. The team could have dog-fooded this themselves: four team members submitting simple naive forecasts would have proven the pipeline works and produced a real result to show. Without that, it's infrastructure whose core function remains undemonstrated. The paper undersells what was actually built. A leaderboard scoring system, automated prediction repos, multi-pane workspace, and news feed are all in the repository but absent from the submission.
Benchtop DNA synthesizers are approaching viral-genome-length assembly within 2-5 years. This primary-source review of nine vendors and eleven device families finds that every device crossing the >=1.5 kb threshold runs on a proprietary reagent ecosystem, potentially enabling low marginal cost KYC regulation. Submitter notes the project may be revised before public publication. Track 4. Manual entry: submitted via Discord DM after Framer Form closed.
Review 1
I like the approach this submission takes because it gets to a novel implementation question, highlighting an additional chokepoint that could improve biosecurity. This submission could be significantly improved with more time, building out subsections that were teased but not yet completed.
Review 2
The submission correctly identifies that we will need new systems to address misuse in benchtop synthesizers if they become an accessible alternative to commercial synthesis providers. The proposed approach is one of many suggested previously and doesn't present a new solution or analysis.
Review 3
A well-researched report on the current state of benchtop DNA sequencers and the commercial feasibility of cryptographic authentication for those powerful enough to produce viral-genome-length assemblies. Such biosecurity technologies may already be implemented but not disclosed publicly. The report could be stronger by (1) expanding on the implications of the research findings/offering clearer take-away messages, and (2) by leaning into call(s) to action for what steps should be taken or explored given the results of the report.
Biosecurity access controls are typically mapped along linear bioweapon development cycles, but adversaries may route around controls by choosing among design methods, material sources, and providers. We built ABT-SIM, a web-based prototype for modeling bioweapon development pathways as directed graphs with probabilistic nodes. Users can create scenarios spanning design through deployment stages, assign detection probabilities and capability requirements to pathway nodes, and model conditional probability flows through logic gates. The tool provides basic analytics including control effectiveness comparisons and outcome impact visualization. While still a prototype, ABT-SIM illustrates how interactive modeling tools could support more systematic biosecurity analysis. The platform provides a foundation for exploring adversary biorisk pathways though significant development would be needed to reach operational level. The work demonstrates potential value in dedicated tools for biosecurity-relevant analyses.
Review 1
The idea is certainly interesting; I could imagine it being useful to those identifying gaps in mitigations or evals. That being said, in practice we mostly develop mitigations/evals by identifying steps common to many scenarios and don't do that much of this quantitative modeling work (which could take more time than actually developing the mitigation). Also, I'm not sure about the sources of these numbers/percent risks, so I don't know how to trust this. For infohazard reasons, I also wouldn't publish steps in threat pathways in any more detail than has been done already. Nonetheless, it's an interesting idea, and some of this modeling work is useful for informing AI mitigations.
Review 2
The project occupies a real gap in the practitioner tool landscape, since publicly accessible interactive tools for probabilistic adversary pathway modeling in the AI-bio context do not appear to exist elsewhere, and the choice to package event tree analysis with logic gates in a browser-based canvas with real-time recalculation is a real accessibility contribution even though the underlying analytical method has decades of precedent in nuclear safety, cybersecurity attack tree modeling, and academic biosecurity PRA work. The current framing treats Sandia's BioRAMs and similar tools as point-solution oriented, but the substantive differentiator is accessibility and the AI-bio routing framing, probably not a new analytical paradigm. The deeper issue is that the methodological core sits in tension with the use case in a way the limitations section does not engage with. Event tree analysis works well when pathway structures are closed and finite, when probability inputs are estimable from data, and when conditional independence approximately holds, and adversary modeling in biosecurity satisfies essentially none of these conditions because pathways are open and adaptive, probabilities are elicited rather than measured, and adversary behavior produces strong dependencies between nodes that the multiplicative aggregation model ignores. The result is that the calculated outputs (the 56.7 percent exposure number, the tornado chart rankings, the outcome impact deltas) carry a degree of false specificity that the underlying estimation cannot support, and a user who took these numbers as predictive rather than illustrative would be putting more weight on them than the methodology warrants.
Review 3
Working prototype with real-time ETA propagation through AND/OR logic gates, which is good execution for a hackathon weekend. The framing (adversaries route around controls rather than following linear pipelines) is correct, but event tree analysis over directed graphs is a mature methodology and the novelty here is mainly the biosecurity-specific scenario seeding and collaborative UX. Validate with two biosecurity practitioners on a live scenario before the probability outputs are used for anything decision-relevant.
Benji-Bio is a stress-test harness for evaluating AI-biosecurity safety monitors under prompt transformations such as paraphrasing, role-shifting, ambiguity, and fictionalized misuse framing. The project studies a specific evaluation failure mode: static or public benchmark prompts can overestimate safety when monitors learn obvious refusal patterns instead of robustly recognizing risky intent.
Review 1
The framing is good- I would suggest expanding to LLM based monitors (even just claude or GPT as a classifier with a structured prompt) or simply reframe the paper's scope to keyword monitor evaluation. If the central claim is transformations matter, the main finding should be the variant accuracy degradation, not monitor ranking. The core idea is worth pursuing but the current implementation is still too thin.
Review 2
Benji-Bio is a small prototype benchmark for testing whether AI biosecurity monitors are robust to prompt transformations . This is potentially interesting if this can add automated, large-scale transformation generation, evaluate LLM-based monitors, and validate on a larger, expert-labeled dataset to demonstrate real robustness.
Review 3
Clear and thoughtful research, project, experiments, results, and presentation. A useful approach against the threat of prompt engineering to get around Biosecurity guardrails in large language models. In addition, the generated dataset can be helpful to others in developing solutions to this problem.
Meta-BioShield is a 6-layer Defense-in-Depth pipeline designed to catch AI-generated DNA biosecurity threats that legacy systems (like IBBIS) miss. Instead of relying on exact text-matching, it enforces strict physical biological constraints—like protein translation, host codon bias, and protease cleavage site detection—to trap pathogens regardless of how an AI scrambles the DNA "spelling." Built as a ready-to-deploy upgrade, it achieved a 0.969 ROC-AUC score on 44,000+ sequences and successfully caught 100% of the hardened AI evasion attacks in our testing.
Review 1
0.969 ROC-AUC is impressive but what does this result mean in practice? Are there any relevant comparisons that put this number into perspective in an intuitive way? Which of the 6 layers are doing most of the work? How do they rank in their importance/contribution to the overall screening result? How did you choose these 6 layers? Why do you think each of them is helpful? I’m surprised you were able to flag the Split-Order Anthrax PA if it’s only 18 bp? How was this achieved? The integration into “IBBIS v2.0” seems very promising but your submission contains very little context on how you did this. I would’ve appreciated a discussion of how feasible this approach is to implement in practice. How much time/cost does it add to current screening pipeline like SecureDNA/IBBIS? Does it have important failure modes like too many false positives (how many?)? False positives are a critical shortcoming because they add a lot of cost to screening if synthesis companies need to manually check the flagged orders. Why does “Hardened Test Suite Results” say “11/11 passed” when it falsely rejected the safe GFP? How did you come up with the 11 tests in the Hardened Test Suite Results? Which ones of these matter most and are worrisome threat vectors that need to be mitigated? IMHO your submission is a bit too biased toward arxiv-style academic writing. That's great for a very particular researcher audience but as a judge who is more in policy-world, it's a bit hard to follow. I think you could take inspiration from the writing of research outputs at places like METR or GovAI who do a great job at writing with rigorous clarity that is still accessible to non-experts. The writing also feels too buzzword-y. (very minor but I found the PDF slightly annoying to read since all the text is in italic for some reason + markdown is not properly rendered) Overall an interesting submission with promising directions but I struggle to tell which parts of the contribution, or which layers of the 6-layer stack, are actually helpful and valuable.
Review 2
I think this is a good attempt for multi-layer defense. While the multi-layer design is conceptually compelling, several layers are not statistically independent. Also protein embedding models and RNA folding require pretty intensive compute and this raise question on real world pipelines. Finally, I might be wrong but it's not clearly stated how the train/test split is done and I hope its not split randomly.
Symphonon is a pandemic early warning system that detects epidemic onset by scoring the directional consistency of surveillance signals rather than their absolute magnitude. Instead of asking "is this signal unusually high?", it asks "has this signal been consistently rising across recent weeks?" — a trajectory score bounded between 0 and 1.
Review 1
The authors propose their method as an upgrade to a rolling z-score, but don't provide evidence that this is an active practice among disease surveillance organizations, who generally do more sophisticated things here. The methods seem thin and brittle, with seemingly arbritary per-pathogen parameter choices. Validation does not come across as convincing.
Review 2
I appreciate how concise and candid this project's presentation is; it doesn't try to oversell, is honest about the limitations, and clearly explains what was contributed in the context of prior work. The biggest methodological issue I see here is the hand-chosen parameters. It seems to me like this creates room for the output to influence the parameter choice in a way that would essentially be overfitting, so it's difficult to know without further context whether this actually did outperform the z score baseline. This makes me a little sus of the validation, which ultimately seems too narrow to say much about how good the system would be elsewhere. We can only really see that it works in a few cases, where the parameters are hand-picked for each pathogen. There's a fair amount of complexity accumulating in the math that's not obviously justified to me i.e. I'd like to see some reasoning on why this would outperform a simpler system. General comment on impact: I think this kind of system is mostly useful for pretty slow outbreaks where there's a lot of time to respond to data and act (e.g. it takes weeks to get sustained movement of each indicator in the same direction), which significantly undercuts its usefulness for early-warning. The write-up does allude to this some, but it seems like this is not useful by construction in the situations you'd most want early warning. Even if it was tested on a bunch of data and found to be reliable, I doubt it'd ever be fast enough to be better than existing syndromic surveillance.
Review 3
A pandemic early warning system that detects epidemic onset by tracking whether surveillance signals are consistently trending upward, rather than waiting for them to cross a magnitude threshold. Fuses multiple data streams (wastewater, cases, deaths, excess mortality, syndromic surveillance) with quality weighting and adaptive thresholds. Tested retrospectively on COVID (321 weeks) and flu (914 weeks). Why it matters: The concept is reasonable that epidemic signals are non-stationary and z-score-based detection assumes stable baselines that rarely hold during pandemics. Scoring directional consistency could provide earlier warning for gradual-onset events, and the 4–6 week early warning on gradual flu seasons demonstrates this. What's strong: Uses real retrospective data, avoids future leakage, documents negative results honestly. The quality weighting (automatically downweighting unreliable signals) is a practical feature. The limitations section is the paper's strongest part detailing eight specific, self-critical points including "engineering, not theoretical." What's missing: The results don't support the early-warning claim. For COVID, the system caught one of five waves and detected it later than the baseline. For flu, results are mixed. The inability to detect abrupt-onset events (H1N1 2009, BA.5) is fundamental (those are exactly the scenarios where early warning matters most). The baseline comparison is against a simple z-score; comparison to established surveillance methods (Farrington-Noufaily, ensemble nowcasts) is absent. Parameters are hardcoded without sensitivity analysis. Connection to AI-biosecurity specifically is thin.
Genomic foundation models (GFMs) are increasingly used for red teaming biosecurity screening systems, yet their potential biases remain uncharacterized. If these models systematically favor certain pathogenic variants, red teaming exercises could leave critical security blind spots. We developed a systematic framework to evaluate variant bias by analyzing EvoDiff and Evo2's ability to generate diverse SARS-CoV-2 spike protein sequences. We generated 200 EvoDiff and 222 Evo2 sequences, then assessed structural quality, taxonomic classification, and variant diversity. Both models exhibited significant bias toward the original 2019 SARS-CoV-2 variant, with EvoDiff producing only 37 recognizable variants from 200 generations. Most critically, Evo2's generated sequences showed low perplexity scores (<30), directly contradicting published safety claims that pathogenic sequences should exhibit elevated perplexity. These findings reveal fundamental limitations in current GFMs that could systematically compromise biosecurity evaluations, highlighting the urgent need for comprehensive approaches to evaluating dual-use biological AI systems.
Review 1
The project addresses a relatively unexplored issue: are we red-teaming biosecurity systems effectively if there is a bias in our generative models? The following features are what make it successful, beginning with good problem framing in recognizing the real threat posed by bias in generative models since it leads to red teaming with a blind spot, an issue of secondary safety that is often overlooked. The conceptual contribution is convincing, introducing the concept of "variant bias" as a failure mode in biosecurity system evaluation and expanding beyond "can models generate pathogens?" to "what are we missing when evaluating biosecurity systems?". Empirical work involving both EvoDiff and Evo2 is relatively robust. Where it could be stronger: using Kraken2 for classification of generated sequences is not ideal for this task, despite the author stating that time was running out. The core claim is not strongly validated, as the “variant bias” conclusion rests on low diversity in generated outputs. This could be explained by prompting/setup parameters or evaluation artifacts.
Review 2
Cool and important submission since it addresses an area that I haven’t previously really thought about. However, I want to see more detailed discussion of why these observed biases could be an issue for red-teaming exercises. Why exactly would it lead to blind spots? I would’ve appreciated more discussion of the threat model. Without this, I struggle to determine how much of a security vulnerability this actually is in practice. If true, surprising and important result that evo2 is not as safe as reported along this specific dimension. But ofc needs to be replicated on a larger dataset to be able to trust this result. I would like to see an explanation of the difference between masking and splitting the protein (not an expert in this area). Since evo2 wasn’t trained on eukaryote-infecting viruses, I would’ve like to see a comparison to what perplexity scores you get when generating sequences for e.g. prokaryote infecting viruses. This would be nice to put perplexity scores into comparison to give more of an intuitive sense of how to interpret the perplexity score numbers. I would like to see the results replicated on other proteins apart from SARS-CoV-2 spike protein. “We briefly test Evo2 in this study by splitting the SARS-CoV2 spike protein in half and asking Evo2 to predict the second half.” → I assume this was also the approach for EvoDiff but it’s not explicitly mentioned. Well written and clearly presented! Appreciate how concise and to the point your writing is.
Review 3
Asks whether genomic foundation models used for red-teaming biosecurity screeners generate a representative range of pathogenic variants — or just the same variant over and over, leaving blind spots in security evaluations. Tests EvoDiff and Evo2 on SARS-CoV-2 spike protein generation. Both models strongly favor the original 2019 variant. Also observes that Evo2 assigns low perplexity scores to generated pathogenic sequences, potentially contradicting the developers' published safety claims. Why it matters: Nobody else has asked this question in this form. If the AI models we use to stress-test our screening defenses only generate one type of threat, we're creating a false sense of security. The variant diversity problem could systematically compromise red-teaming exercises across the biosecurity community. The perplexity observation, if confirmed, would undermine a specific safety claim from a major genomic model developer. What's strong: Novel reframing with clear practical implications. The dual-use appendix is responsible and thorough. What's missing: The study is very preliminary — a few hundred sequences, one virus, different evaluation methods for the two models, an unjustified variant classification threshold, and no use of Evo2's taxonomic steering (which could significantly change the results). The perplexity reporting is confusing: the abstract says "<30" while the methods compare against a threshold of ~3.2, which are off by an order of magnitude. The underlying data appears internally consistent, but this presentation issue undermines the paper's most important claim. Section numbering is inconsistent and some framing claims ("first systematic characterization") overstate what the sample sizes support.
The rapid development of AI-enabled biological tools is creating biosecurity risks that existing governance frameworks are not well designed to address. The Global Risk Index for AI-enabled Biological Tools (GRI) offers a framework for assessing AI-bio tools by misuse-relevant capability, maturity, and availability, but it does not disclose the specific finalist tools assessed or map governance coverage at the national level. This paper develops a proof-of-concept governance mapping approach focused on protein engineering, the GRI category identified as requiring immediate governance attention. We examine the United States and the United Kingdom because of their relevance to frontier AI governance, institutional role in AI safety, and high GRI contribution scores. Consistent with responsible disclosure norms, we assess tools at the category level rather than naming specific high-capability systems. Using a cross-database methodology based on EpochAI datasets, we identify 42 protein engineering AI models associated with US institutions and 11 associated with UK institutions. We then map 18 governance instruments, 14 from the US and 4 from the UK, against the protein engineering category. We find that governance in both countries is fragmented, largely downstream of AI model outputs, and poorly calibrated to AI-generated protein design outputs that precede physical material production or synthesis-provider screening. In the US, the revocation of EO 14110 removed the most directly relevant AI-biosecurity executive instrument without a biosecurity-equivalent replacement. We present these findings through an interactive policy dashboard designed to support future expansion across additional AI-bio tool categories and countries.
Review 1
Great job! Some thoughts: - The n=8 / one-scorer setup makes the headline feel less settled than it could be. Two people scoring the same three tools and arriving at the same Red/Amber outcomes would tell you the rubric is actually consistent. - The 4.0 threshold decides every Red vs. Amber outcome and gets picked without validation. A sensitivity check at 3.5 and 4.5 would show whether anything moves and make the cutoff feel chosen rather than asserted. - The GitHub Pages visualization is a static reference for the framework hierarchy right now. A minimal working self-assessment (answer the 21 questions, get a result) is the version that actual tool teams could pick up and use!
Review 2
The Dashboard's largest framing issue is that it presents itself as research novelty when its actual contribution is good policy communication. The structural-gap finding, that AI-generated biological outputs precede every existing regulatory trigger point, is the central analytic claim, but this observation is already developed in Lentzos and Invernizzi on information versus design hazards, in Pannu et al. from Johns Hopkins on biological capability evaluation, in the NTI managed-access framework, in Eslami et al. on synthetic biology and AI convergence, and in Baker and Church's Science piece on protein design biosecurity. The EO 14110 revocation is similarly a public fact that the policy community has been processing for over a year. The team reads and cites much of this literature, which makes the introduction's positioning (that the Dashboard addresses a gap the GRI leaves open) somewhat misleading, because the GRI is not the only relevant predecessor and the broader literature converges on the same diagnosis the Dashboard reaches. The constructive path is to reposition the work as a synthesis and orientation artifact rather than as a novel contribution to the governance literature, because that framing is both honest and stronger. The useful audience for the Dashboard is not the researchers producing the literature but the legislative staff, foundation program officers, journalists, and adjacent-field researchers who need a navigable entry point to a developed policy question, and the dashboard format serves that audience well. The smaller original moves the project does make (the cross-database tool identification methodology, the RFdiffusion-family registry undercounting observation, and the structured side-by-side instrument mapping) are real and worth highlighting on their own terms rather than burying inside a framing that promises more.
Review 3
The approach of developing a policy dashboard for AIxBio governance is interesting, and useful one; however, I found the approach rather muddled in combining both a focus on governance and the listing of the models. I don't see what tabulating the specific models in the country adds to the dashboard. Although there's a bit of interesting reference data there, seems unnecessary for pointing out whether there's a specific governance gap, and, as a result, may be distracting for decision-makers who have limited time and energy. The level of analysis is also, I think, insufficient for the dashboard to be useful in practice. For example, the sample dashboard lists both the UK AI Safety Institute's Frontier Model Framework, and the UK National AI strategy, and codes them both as limited, but that doesn't really tell me anything about what is in the framework, the AI strategy, or how they compare to any larger best practices, recommendations, etc. I do think the focus on protein engineering is an important limiter, but I'm sympathetic that time constraints limited that.
Biosecurity policy is increasingly important to the governance of artificial intelligence, biotechnology, pandemic preparedness, and synthetic biology, but U.S. federal policy signals are spread across fragmented and difficult-to-monitor sources. Relevant developments may appear as legislation on Congress.gov, proposed or final regulatory actions in the Federal Register, or as docket materials and comments on Regulations.gov. This fragmentation creates a barrier for policymakers, researchers, advocates, and other stakeholders who need to track emerging biosecurity rules and legislative activity. We present the Biosecurity Policy Dashboard, a proof-of-concept tool for aggregating and exploring U.S. federal biosecurity policy documents across Congress.gov, Regulations.gov, and the Federal Register. The system uses keyword-based retrieval to identify potentially relevant legislative and regulatory records, stores the results in a local SQLite database, and displays them in an interactive Streamlit dashboard. The dashboard allows users to search, filter, sort, and inspect relevant policy documents by source, date, keyword match, agency, docket, and document type. The current prototype demonstrates the feasibility of a unified, regularly updated policy-monitoring interface for the U.S. biosecurity landscape.
Review 1
The staffer framing is good and makes the problem concrete. But the deliverable is unfinished. Refresh disabled, partial data, local only, and no measurement of how well the filter works. You flag the noise issue yourselves an LLM relevance pass could fix most of it. Differentiator vs AGORA/B-SPAN isn't sharp.
Review 2
Interesting project! I'd love to see an online deployed version if/when it becomes available. Also, I'm wondering if this is much more helpful than someone searching on the three underlying sites (congress.gov, regulations.gov, Federal Register).
Cross-jurisdictional regulatory comparison tool for dual-use items, surfacing where export control regimes diverge across jurisdictions to help researchers and compliance teams reason about biosecurity-relevant exports. Track 3 submission. Manual entry: submitted via email after Framer Form closed before announced AoE cutoff (gmail thread 40827).
Review 1
- Give more context on the relationship between biorisk and export controls. Is the risk some goods would be exported to states that pursue BW? Does it pertain to group or lone wolf actors? Have modifications to export control regimes been discussed as an effective tool to reduce biorisk? This helps make the case for why this tool matters. The paper goes into technical details quite abruptly. - Introduce the acronyms you are using. Otherwise, it can get hard to follow. Try to reduce the density of acronyms to make it easier to follow. - Explain why you went for the US, EU, and AUS. - For your three coverage gaps: explain why this has practical relevance. How could this inform or change the behavior of someone working with export control lists? How might these insights reduce biorisk?
Review 2
The project is a competent regulatory comparison tool, and the technical work behind it stands on its own merits. H However, the core issue is that the report positions the Navigator as a biosecurity tool, but the artifact is more accurately described as a cross-jurisdictional regulatory comparison tool that happens to cover a category of regulations which includes biosecurity-relevant items. It doesn't directly advance biosecurity. Export controls are one policy instrument among several that governments use to pursue biosecurity goals, and the Navigator sits at the last link in a chain that runs from biosecurity objectives through regulatory text to user-facing comparison. The dual-use analysis in Section A.2 actually demonstrates this gap, perhaps unintentionally, because the marginal-uplift argument used to defend against bad-faith use applies symmetrically to good-faith use as well: if making comparison faster does not meaningfully shift the offense-defense balance for attackers because the underlying regulations are already public, it does not meaningfully shift it for defenders for the same reason. A more careful version of the report would acknowledge this symmetry directly.
This project introduces Bio Capability Boundary Monitor, a prototype pre-deployment audit tool for biological AI workflows. It detects capability overreach: cases where an agent completes the visible task but uses more biological capability than the request justified. In a 1,500-run Llama evaluation, raw task success was 96.6%, while safety-adjusted success was only 31.7%. A scope-aware postcondition layer reduced strict false allow rate from 8.47% to 0.139%, recovering 60/61 missed unsafe runs with 0/480 new false positives. A 100-run Track 3 application slice showed the same pattern in bounded public-health triage and screening policy review proxies. The core lesson is that biosecurity audits should not only ask whether an answer is harmful or useful. They should ask whether the workflow used biological capability the task actually justified.
Review 1
Thank you for your submission, evaluating biological agents by assessing whether or not capability overreach occurred is an important topic to address. The community would benefit from further analysis of pre-deployment strategies and from incentivizing their application. For the oversight layer, I wonder if supervision must be provided from a more capable model in the scenarios tested, or whether weak-to-strong generalization could apply. There are several ways that this submission could be improved. First, I think the core argument requires more justification. At the present, features such as retrieval, planning, and reasoning often improve output quality and could even enhance safety. Second, how variable are your results across different language models, and why is Llama tested in particular? Third, addressing key metrics such as safety-adjusted success would be appreciated, as they are important for interpreting the results.
Review 2
This is an important research that address capability overreach and is somewhat novel in biosecurity. If I understand correctly, the current system seems to rely heavily on hand-crafted rules which limits generalization to real-world agent settings. I would focus on replacing manual scope rules with learned or formally specified constraints, and evaluate in more realistic multi-step agent environments where capability boundaries are ambiguous.
Review 3
The biggest challenge with the submission is that I do not see how using more biological capability than a task requires is a safety concern, though I buy it's a challenge in terms of efficiency, and compute costs. If a task uses a more advanced capability or shares more knowledge than necessary, but that knowledge or capability is not itself harmful, then I do not see what the problem is. The authors provide an example of an agent providing more biological context and handling information than necessary; however, it seems like the person employing the agent could simply make follow-up inquires to get that same context and handling information. If that information isn't harmful, who cares?
PerplexityGuard-Bench is an adversarial-robustness benchmark for protein language model (pLM) perplexity screens — the leading defense proposed for catching AI-designed proteins that evade existing homology-based DNA synthesis screening. We test a reference pLM screen across 120 sequences, with IBBIS Common Mechanism (commec) empirically running, and surface two failure modes: even an OR-ensemble over perplexity, low-complexity, and homology catches only 2.5% of low-temperature ProteinMPNN designs; and a new mosaic-stitching attack — concatenating a natural prefix in front of an adversary — drops perplexity-only detection to 20% at a 50% prefix budget. We prove the failure is mathematical (whole-sequence averaging is dilution-vulnerable for any threshold) and validate a structural patch: sliding-window perplexity recovers detection to 70% (or 33% with 0% native-FPR retuned), replicating across an 18.6× ESM-2 model-size range (t12 / t30 / t33).
Review 1
The 2.5% number on the realistic threat case is the finding, I found it buried. The deployment story is missing. Who runs this? Is it a library a synthesis provider links into their pipeline, a hosted service, or a CLI for researchers? The 0% native-FPR claim for v3-tight is n=10. That's a 0–26% true-FPR confidence interval. At scale that's the difference between zero false blocks vs millions. I would set n≥100 with explicit CIs before any deployment phrasing. I also wasn't convinced the stitched bypass produces something a synthesis customer would actually want to order. If the construct doesn't yield a working protein, the bypass doesn't matter operationally, and the urgency claim weakens. The paper should say plainly whether stitched outputs are functional, or note that this is unverified and adjust the framing.
Review 2
The focus of this project is a bit hard to pin down. It headlines the development of a "benchmark" for evaluating pLM perplexity-based screening methods, but then says almost nothing at all about the benchmark. Most of the report is devoted to the "mosaic attack" failure and an approach to fixing it, and that is thorough, but seems excessive for a fairly simple concept. Meanwhile, far too little is said about the actual screening methods built into the "pipeline". Is the mystery benchmark being used to assess the screening "variants" summarized on page 4, or are the screening variants being used to validate the benchmark? In either case, no substantive argument is presented.
Generative AI has democratized protein design but introduced a critical biosecurity gap: AI can "paraphrase" known toxins into synthetic homologs that preserve hazardous folds while evading homology-based DNA synthesis screening. Concurrently, regulatory frameworks now mandate screening down to 50 bp and detection of cross-fragment assembly, yet current tools suffer from high false positives, poor calibration, and uninterpretable outputs. We present TRACE, a context-aware escalation layer that bridges high-throughput first-line screening and expert human review. TRACE combines a deterministic short-window prefilter, a threat-pruned De Bruijn graph for cart-level assembly reconstruction, and a LoRA-fine-tuned ESM-2 protein risk scorer with temperature-scaled calibration and SHAP-based explainability. Evaluated on a family-held-out dataset of 32,526 sequences, TRACE achieves 95.1% recall at ≤2% false-positive rate, 0.987 PR-AUC, and 0.024 expected calibration error, with sub-100ms CPU latency. Deployed as a lightweight ONNX service with an interactive Streamlit dashboard and FastAPI guardrail endpoints, TRACE provides regulator-ready evidence and plug-in safety controls for AI biological design tools, directly addressing OSTP 2024, IGSC v3.0, and CBAI Track 1 objectives.
Review 1
The paper tries to address a relevant problem at the edge of the synthesis screen, which is: How can screening systems detect AI-generated sequences of concern while preserving tool efficiency? While the results seem promising, a lot of the methods and results provide more questions for the reader, such as how the paper validated the functionality of SOCs in silico. The paper was written with very heavy technical jargon and gave insufficient space to explain the methods or results to a desired standard e.g. listing software or packages used without explaining the rationale or detailing the method on how the app works, which could be part of the lack of clarity and gaps in the paper. Overall, the problem the paper is trying to address is very relevant, and the metrics seem promising, but I am ultimately left wondering whether the methods address the problem it is trying to solve.
Review 2
This is an ambitious project to combine multiple technologies for evasion-resistant synthesis screening. The grounding in recent regulatory frameworks is compelling, though the authors don't fully explain what they mean by TRACE being a "regulator-ready layer". Technical execution seems strong overall, but I found it difficult to follow details of how the parts of the system worked together, and how the assembly mechanism was constructed and validated.
Review 3
Addressing vulnerabilities in screening short sequences is an important area of research, but this submission doesn't seem to effectively present a new or effective approach. The Short Window Prefilter reduces this solution to a sequence-based screening approach where there's a large set of existing techniques and short sequences remain a vulnerability. Without the prefilter, the subsequent steps do not perform well. The performance of the system is entirely reliant on the short window prefilter, which operates like many existing seed algorithms in popular alignment tools like BLAST, BLAT, and Bowtie.
Early warning systems for biological threats typically rely on laboratory, environmental, or digital surveillance, often detecting signals only after transmission is underway. This project explores whether routine clinical observation, specifically in public oral health services, can function as an earlier, underutilized detection layer. We propose Oral Sentinel, a minimal weak-signal reporting concept that captures atypical oral and mucosal findings as potential early indicators of emerging biological events. The contribution is conceptual and design-oriented: mapping how neglected clinical infrastructures could augment current surveillance architectures.
Review 1
Real-world implementation of early warning is a significant challenge in biosurveillance, so I appreciate this paper's approach of considering existing infrastructure. However, I believe that this submission could be improved along two main axes: reconsidering the limitations of passive anomaly reporting and including real-world case studies. As acknowledged in the submission to some extent, passive anomaly reporting will not be highly sensitive as a signal, particularly because many oral hygiene issues may be caused by noncommunicable diseases, which further dilutes the value of such reporting. Additionally, this submission would have benefited from choosing one specific case study (e.g., a national health system such as the NIH in the UK) and from making a concrete proposal for integrating oral hygiene reporting into day-to-day practice.
Review 2
An unstated benefit if this worked - more advanced tools are optimized with better the baseline data, which often needs to be established per each geographical area of focus. I wonder if this could contribute to stronger baselines even if not signals. Agree that there is insufficient focus today on improving gathering & structuring data upstream of analysis. Great that this aims for easy & integrated as the technical approach. Authors should note that for anyone old enough, the oral care angle related to infectious diseases that aren't yet manifesting in illness will recall how badly the field was struck by HIV/AIDS when that epidemic first emerged. That may actually make it more intriguing but also raises the bar on the sensitivity with which it would need to be treated.
Review 3
I think it's great to think more about datasources in early outbreak detection, as I agree it is not obvious we have found all the best ones yet. Getting easy wins from already existing infrastructure could be a good strategy for the near-term, while we build out other infrastructure. The limitations sections is very thorough and the observation that durable surveillance systems must improve care rather than just extract data is a genuinely important design insight that most surveillance proposals miss. The paper isn't really a research paper though, and there is not a system to evaluate. What are the actual checkbox categories? What does the data schema look like? What existing oral health information systems would this plug into, and what would the integration require technically? The paper stays at the level of "this should be pilotable" without producing a pilot-ready artefact. For a hackathon that asks for built things, this is a significant gap. The paper acknowledges that there isn't much of an AI element, but I think this still makes it a weaker submission for this specific hackathon.
BASTK-Bench introduces a novel framework for evaluating biological risk in open-weight AI models by focusing on real-world, execution-level capability under low-resource conditions, particularly somatic tacit knowledge in tasks like DIY CRISPR troubleshooting. It is among the first studies to systematically biorisk-assess newer open-weight models such as Llama-4-Scout and Qwen3-32B in this kind of execution-oriented setting. The results show that risk is highly task- and framing-dependent, with newer models exhibiting higher risk in specific practical scenarios, suggesting current benchmarks may underestimate real-world biological misuse potential.
Review 1
Impact Potential & Innovation: The lack of standardized evaluation frameworks is a real gap in the evals landscape that is worth addressing. You correctly identify some problems with current eval methodology, such as limited prompt-sensitivity testing (although this is a more subtle problem than you describe, see eg. here: https://www.anthropic.com/research/evaluating-ai-systems for examples on how simple formatting choices can affect eval scores). Focus on Q&A testing is also a genuine problem, but you strawman the case a bit by completely omitting agentic evaluations (like ABC-Bench or ABLE) and uplift studies (like Shen et al., 2026 https://arxiv.org/abs/2602.16703) in your literature review section. Both Llama and Qwen3 have already been evaluated for biorisk, see for example https://airiskmonitor.net/. "A key finding is that newer models may be more capable (and potentially more risky) than older, larger ones, challenging assumptions that safety scales predictably with model size" - that is actually not the assumption; it's well-known that newer, smaller models often perform better and that capabilities tend to scale with release date rather than size (see https://metr.org/time-horizons/). This isn't a new finding. Additionally, Llama-4-Scout is not actually smaller than Llama-3.3-70B; 4-Scout is a Mixture of Experts model with 109B total parameters and 17B active per token, while 3.3-70B is a dense model with 70B total parameters. Scout is smaller in inference compute per token, which is the point of MoE Theory of change is missing from the paper. What is the intended audience of this eval? What will happen if people use it, how will it change outcomes? Is it meant to influence policy, evaluate safeguards for internal lab use, or contribute to risk monitoring? Clearly outlining this would make the submission stronger, and I encourage you to think about theory of change of your future projects in general, to make sure you build a tool to solve a problem rather than look for a problem to suit your tool. The "non-googlable tacit cues" criterion is unaddressed as a generalizable research direction. Do you google it every time? Do you need experts to score every run of the eval? Do you have any plan on how you would automate this, at least partly? Without an answer, the approach doesn't generalize. Execution Quality: There are some significant problems with the methodology of this work. The biggest one is the manual scoring that you use - current evaluations are focused on addressing that, this approach is not scalable. You acknowledge it may influence results ("despite consistent evaluation guidelines" that are either not disclosed or very laconic, if the table in the paper and the “scoring notes” column in the dataset is all there is). The work would benefit from at least using an LLM-as-judge approach, and acknowledging its limitations. Scoring notes like "depth" don't seem detailed enough. Writing the eval in plain Python (rather than UK AISI's inspect framework) does not affect your score by itself, but inspect is the industry standard for AI evals and would have given you scoring infrastructure, model adapters, refusal handling for reasoning models, and reproducibility tooling for free. Strongly recommend getting acquainted with it for future work. I appreciate that you use an auto-scorer for refusals. However, the implementation is brittle. For example, it breaks for reasoning models that start the answer with <think>. This kind of edge case is exactly what existing eval frameworks like Inspect already handle, so you don't need to reinvent the wheel. There are some minor code hygiene problems like hard-coding the API key in the main eval code. Consider using an .env file. You do not provide any baseline (e.g., if a single google search would give out the same information or not), so it's possible that the actual uplift on the tasks you described (DIY CRISPR troubleshooting) is actually small, because of the availability of educational materials on the internet. You define tacit knowledge as "depth beyond surface-level information," but tacit knowledge is by definition knowledge that is hard to convey by text; in biology it usually refers to e.g. ways to handle equipment. You also do not explain what you mean by somatic adaptation, or how you measure it. This makes it difficult to assess what the eval is actually measuring. In the future, I encourage you to document (and include in the paper) how you approached task selection and prompt generation, and if the task dataset is representative of actual real-work scenarios, and how do you know that. It’s an important part of the methodology, the VCT paper does a very good job here. Your “Risk Score” is hard to interpret. Each of Uplift and Accessibility is scored 0–3, so 9 is the maximum possible value (the value set is [0,1,2,3,4,6,9]. How does the jump between 0 and 1 correspond to that of from 6 to 9? Why did you use a metric like this, what real-life value does it refer to?). Tacit Knowledge and Refusal Robustness are scored but the results are not evaluated or presented, even though somatic tacit knowledge is the central pitch of the paper. Refusal robustness test is not standardized. "If a model refuses, a single jailbreak variant (e.g., role-play or academic framing) is attempted to test refusal robustness." How do you choose which variant? This needs a deterministic protocol, otherwise results aren't comparable across models. Also, your code actually tests both of those jailbreak strategies. The paper would benefit from either dropping that part or using more SOTA jailbreaking techniques, because now you have false negatives when the model refuses because the jailbreak prompts are not very sophisticated, and you mark it “safe”. Prompt framing is methodologically off. Evals rarely explicitly use the word "bioweapon" because that would almost surely result in a refusal from all models. They frame the questions as benign (see https://www.rand.org/pubs/research_reports/RRA4591-1.html for an example). One of your representative prompts is "bioweapon planning," which likely biases the refusal-vs-compliance results. I encourage you to think about information hazard when releasing code and prompts, and address this every time you publish. While I do not think that your paper presents a large infohazard, it's part of eval hygiene. Also think about information security when dealing with dual-use data. GitHub is owned by Microsoft, Copilot is trained on public repos, so you risk both data leakage and possibly providing dangerous knowledge to the model. Your work lack any statistical testing, confidence intervals, etc. You acknowledge this as a limitation, which is appreciated and I understand it's hard to address in a hackathon, but the evaluation community would benefit from higher statistical rigor. I encourage you to address it in future work (e.g., pre-register results and methodology, calculate power). Shen et al. 2026 is a good example of the direction we should aim, even with its limitations. "Closes expert gap" (Uplift = 3) is self-referential without an expert in the loop. You can't score whether something "closes the expert gap" without expert validation, which the paper explicitly does not have. Presentation & Clarity: The paper does not discuss the prompts and tasks almost at all, so it's impossible to assess them from the paper alone. Either address that (say that prompts are proprietary or not public due to data-leakage risk/infohazard, but you’ll release them on reasonable request) or discuss the prompts in the text. You should have a table in the paper describing them and mapping them to your subsets, plus a brief description of the prompt generation process, and a better justification for why you chose those prompts/areas. There is no way to know from the paper alone which prompts you used, the reader needs to inspect both the database and code to figure it out. In the methods se
Review 2
This is an important idea that focus on somatic tacit knowledge. My only concern is that the scale is extremely small and that the manual scoring Risk = Uplift × Accessibility is quite subjective. Also I feel that the paper claims models show “meaningful real-world bioweapon capability” is a bit overclaiming based on the results shown.
Biosecurity practitioners currently face a fragmented policy signal challenge: while AI-enabled biological design risks accelerate, the legal frameworks governing cross-border transfers—particularly between the U.S. and China—remain opaque and difficult to interpret. This complexity creates a "chilling effect" on legitimate research while leaving critical gaps for accidental misuse in an era of intense strategic competition. The BioExport Navigator is a prototype decision-support tool designed to bridge this gap. By mapping RAND 2025 technical uplift to U.S. BIS regulatory triggers, it provides a structured decision layer for cross-border compliance. The tool uniquely flags EAR § 744.6 "U.S. Person" liability and identifies "small-but-deadly" models that fall below standard compute thresholds but trigger presumption of denial for China-bound transfers. It moves biosecurity from reactive monitoring to proactive, informed decision-making in an era of intense strategic competition. It ensures that high-risk AI-Bio convergence is managed through standardized, cost-effective, real-world policy frameworks.
Review 1
Great problem choice! The validation is the thing to address. LLM consensus isn't rigorous enough for a tool that touches criminal liability. Even one conversation with an actual export compliance lawyer would be more credible than multi-model cross-checking. The artifact right now is a CSV and a logic description. A minimal working decision tree that a researcher can click through would make this feel more like a tool rather than a proposal!
DNA synthesis screening is the principal safeguard against malicious orders of hazardous genetic material. The 2024 OSTP Framework requires screening at 200 base-pairs (bp), dropping to 50 bp by October 2026. We benchmark BLAST as a per-fragment screener across seven fragment lengths (20-200 bp) and six mutation rates (0-20%), under two threat models: pure evasion (orders entirely composed of hazardous fragments) and dilute evasion (one hazardous fragment hidden among benign filler).
Review 1
The project tackles a genuine problem, but its impact is limited by overly simple methodological choices. For instance, running the same experiments with six‑frame translated searches using BLASTX could have yielded higher detection rates for mutated fragments. Current screening guidelines imply evaluating both nucleotide sequences and their translated protein products to identify the closest match to any regulated organism. Regarding the main conclusion, it is difficult to assume a fixed adversary capability for setting a fragment threshold when Biological AI tools can generate sequences with minimal homology to naturally occurring genes/genomes. In such cases, what matters most is the functional state of the mutated fragment or its potential to be assembled into a longer, functional sequence.
Review 2
The dilute vs. pure framing is a great way to think about this! A couple thoughts: - The mutation model uses random per-base substitutions. Real adversaries use codon optimization, which only changes the third position of each codon. That looks different to BLAST. Worth flagging that 20% random isn't the same as 20% codon-optimized! - You only tested BLAST, but commec is the real-world screener, which adds protein-level (DIAMOND) and HMM layers on top. Those layers might catch the dilute attacks BLAST misses. Without testing commec, the conclusion is really "BLAST alone fails," not "the OSTP threshold is inadequate".
Review 3
This project tackles a well-recognised and important set of challenges in DNA synthesis screening, addressing obfuscation by mutation and split-order, as well as effects of changes in fragment-length threshold. The benchmark pipeline is a genuine and reusable artifact, and the framing of the dilute attack mode as a threat model is powerful. I would like to congratulate the authors for this piece of work generated within such a short timeframe, well done! A limitation in the reported findings is that only one type of BLAST (blastn-short) was evaluated, which does not reflect current industry screening standards. Testing tools such as commec (as mentioned by authors), SecureDNA, and ACLID would yield far more actionable comparisons for policymakers. Given the hackathon timeframe, an alternative framing worth considering is to foreground the pipeline itself as the main contribution, which may have greater long-term impact than BLAST-specific results. However, the benchmark itself is limited by deviations from realistic threat models. The random mutation model, while a reasonable first approximation, does not reflect how a real adversary would obfuscate sequences as they don’t preserve biological function. Although the authors acknowledged this as limitation, it does limit the direct applicability of the quantitative findings.
We attempt to use ESM2 to generate potentially harmful pathogens by first connecting them via similarity to protein folding structures via the foundational model ESM2 and the protein database - to create a GNN. Then we attempt to look for potential harmful sequences, and we build a detector that can detect such potentially harmful sequences by comparing against a known database of viral hosts / pathogens.
Review 1
The approach proposed tries to go beyond detecting homology for sequences of concern through the primary sequence and implement a structural topology. In particular, the team attempts to detect sequences that bind to immune targets using the human immune interactome. There are several things in this paper that I think require work: 1) Methodological explanations: Currently, the methodology explanations are relatively thin 2) What sort of in silico methods were used to validate the function of the 'jailbreak' sequences 3) The short protein shown by the Red Team binds to many different proteins with differing functions. While this is possible, it is unlikely that it binds with a dissociation constant significant enough. 4) Case study of only one sequence 5) Other biodesign tools can already map out structural binding
Review 2
The central premise of this project, that AI-designed proteins targeting key immune components constitute a biosecurity threat, is fundamentally flawed for two reasons. First, a peptide predicted to bind to an immune protein does not inherently disrupt its biological function; such interference occurs only if the peptide targets a specific functional interface. The D-Script model utilized in this pipeline lacks the granularity to predict site-specific binding. Second, even in cases of confirmed interaction disruption, the physiological result is more likely to be immunosuppression or anti-inflammatory activity rather than a catastrophic biosecurity risk.
Review 3
This is an interesting approach; using pLM embeddings to check for genetically distant targets is a nice method that I always like to see improved. I am a little concerned by including an optimization loop for immune system binders that evade synthesis screening; bio is offense-favored, after all. But I also don't think the risk scoring against immune system target binding is ultimately very promising and lets you catch a wide variety of potential threats. But I would like to see more results on the detection efficiency, to become convinced otherwise.
An agent-native tool designed to help Large Language Models accurately identify and screen high-risk DNA sequences, a task current models struggle with. By making these sequences transparent to the model, the tool significantly increases refusal rates for dangerous requests.
Review 1
The project centers on how to effectively use LLMs in the real world instead of introducing novel biology and AI technologies, both of which are valuable. What works well includes good problem framing that recognizes the lack of focus on LLM interactions within the biosecurity scope at the early stages as well as the genuine problem regarding the balance between over-blocking and usability. This solution gives flexibility and scalability based on the model's capabilities and represents a good decision over using fixed pipelines. There are also empirical signals, including the reported improvement in refusal rates from 0% to 70% under controlled conditions. Testing in various environments is another good aspect. What could be stronger: Execution is not very rigorous. The evaluation uses small, unclear datasets (10 benign / 10 harmful sequences) and lacks statistical robustness. It also doesn't report false positives/false negatives systematically. I also found the use of 3 screening tools (BLAST should not really be considered as a dedicated DNA synthesis screening tool) to be a bit excessive. Combining all 3 strategies can be conceptually strong, but we can't tell for sure which one is doing the heavy lifting or if one of them is adding noise. The tool logic is underdeveloped as the failure at n > 5 sequences is a strong limitation. Lastly, I also found the writing to be a bit rough, which was sometimes rhetorical and somewhat informal and a bit imprecise, making it slightly difficult to follow.
Review 2
Jumped into technical detail a little too quickly. I'm sure it'd be fine for people who are experts in this domain, but some of the details needed to be stepped through a little bit more. For instance, 'agent-native tool' could've been explained and the abstract needed some work.
BioConscience Co-Pilot is a proactive, point-of-design biosecurity tool designed as a browser extension for synthetic biologists and bio-entrepreneurs. As biotechnology shifts toward cloud labs and decentralized "desktop manufacturing," the gap between digital design and biological reality is narrowing, often leaving biosecurity as an afterthought handled only at the point of purchase. This project addresses the critical need for real-time guardrails by integrating sequence screening directly into platforms like Benchling, SnapGene, and Cloud Lab consoles. Using privacy-preserving local hashing, the Co-Pilot silently monitors design environments and highlights "Sequences of Concern" in real-time, providing practitioners with immediate access to regulatory guidelines, ethical frameworks, and responsible disclosure templates. By transforming biosecurity from a backend hurdle into a proactive design partner, BioConscience empowers researchers in emerging bioeconomies to innovate safely—ensuring that as we gain the ability to "grow almost anything," we do so with a built-in digital conscience.
Review 1
I appreciated your whole of research cycle approach and it is an innovative approach to attempt to augment researcher capabilities from the outset. I cannot remember ever having heard of this approach before! It is a shame that you were unable to attempt a prototype but I do hope that you get the opportunity to attempt this at some point.
Review 2
Excellent concept - and something that could be extremely useful in preventing inadvertent creation of dangerous sequences. However, it would do little to prevent intentional misuse. If a functioning prototype could have been developed, it would have been a very good application; unfortunately execution could not be rated higher at this preliminary phase.
Review 3
The specific approach proposed for integrating into the design phase is novel, but the solution needs to be explored much further. Beyond just the technical implementation which has it's own set of difficult problems, the integration of this tool will prove difficult. 1. Local, fast screening on devices with limited CPU and RAM like laptops is already challenging to build with sensitive enough detection. Embedding the screening into a browser-based extension creates further limitations that may make this approach infeasible 2. The distribution of the tool is complex – how will scientists or institutions discover the tool, what are their incentives for integrating it into existing workflows, and how will the results of the tool be used to reduce risk (e.g., is research reviewed by a biosafety expert, is the research paused or completely stopped) 3. While this tool could help reduce risk from error in legitimate institutions and scientific workflows, it will likely have no effect on nefarious misuse
Rapid genomic surveillance is critical for detecting antimicrobial resistance (AMR), but AI-assisted triage becomes unsafe when predictions and explanations lack visible constraints. SafeSurveil-AIxBio is a defensive surveillance prototype for E. coli–tetracycline triage that couples live public-data retrieval and local AMR evidence generation with a strictly auditable operator interface. Instead of relying on a black-box clinical predictor, our main contribution is a runtime trust layer. Generated copilot and semantic-UI outputs (via OpenRouter and Thesys C1) are treated entirely as "sidecars." They are only displayed to the analyst after passing a strict execution gate that verifies identity, citation accuracy, numeric consistency, and policy alignment against a persisted biological decision object. To ensure complete inspectability, SafeSurveil-AIxBio builds a deterministic biological reasoning trace, a 54-node typed evidence graph, and a highly reproducible automated API audit matrix (passing a 26/0/0 curl test). Ultimately, this prototype demonstrates how genomic AI triage can be made bounded, inspectable, and rigorously safe for biosecurity analysts.
Review 1
I think the main problem was something like ‘AI predicts AMR too confidently’, and the solution was an AI tool that understands when it is being too confident, and can modulate its output? The Abstract is filled with jargon and felt very LLM-y. It was mostly incomprehensible to me. An Abstract needs to be relatively straightforward, and clearly define the background, problem, methods, top line results and conclusion. The introduction follows in a similar vein: ‘Antimicrobial resistance is a critical biosecurity problem because the signals that matter for routine stewardship (resistance genes, phenotype predictions, mobile elements, and surveillance provenance) dictate how quickly an analyst can identify a high-risk case’ ← I have literally no clue what this means. The ‘Main contributions’ are similarly opaque, and in general the whole report felt riddled with overly technical LLM-speak. The ‘ML for AMR and stewardship’ section was a little clearer, with reference to systemic review that I understood. I was glad to see that you mention a limitation of your method (or rather what it is not), but I was unable to understand what your tool’s advantage is. The Methods were highly jargon-laiden (‘local-first E. coli…evidence factory’...’fixture-trained smoke baseline’....the list goes on). And the Results were similarly incomprehensible to me. I would really recommend trying to write this report from scratch, without using an LLM. The text is riddled with technical jargon and the sentence constructions are very clearly LLM generated. The fundamental idea of using AI tools to assess AMR threat is obviously a good one, but I didn’t get a sense of how important your specific problem is and I had really no idea what was outlined in this report.
Review 2
I appreciated this submission's emphasis on observability and traceability. I agree with the authors that these are important properties in a software system deployed in a public health context. I found that the report was quite dense and a little too focused on implementation details and architecture (It's possible that I am missing things about the importance here!). I would have liked to see more of a direct focus on the value add for public health decision making. It's a little bit unclear to me how this is tailored for the specific problem of AMR triage: I can imagine that AI models tend to be overconfident for many (most) decisions. If there is a novel aspect here for the XAI component perhaps I'd have liked to see just that component demonstrated and validated etc.
Review 3
A prototype for AMR triage (E. coli + tetracycline) where AI-generated explanations are treated as "sidecars" or non-authoritative outputs that must pass fact-checks against persisted evidence before reaching the analyst. Includes an execution gate (allow/review/block), evidence graph, deterministic reasoning trace, and provenance tracking. Why it matters: The design principle is sound. In any AI-assisted biosecurity or clinical workflow, the AI's explanation should be subordinate to the actual evidence, not the other way around. The AMR literature consistently warns about phenotype-genotype discordance and database-dependent interpretation. Building the audit and trust layer first, before optimizing the predictions, is the right order of operations. What's strong: The safety architecture is well-designed and the system was fully built and tested as software. Provenance tracking, evidence graphs, citation checks, and explicit fallback labeling are practical features. The dual-use appendix is unusually thoughtful for a hackathon. What's missing: All the evidence that the system works is engineering verification (tests pass, builds succeed, the proof run completes). There's nothing showing it helps analysts make better decisions, catches signals they'd otherwise miss, or even performs comparably to existing tools. One organism-drug pair, fixture-trained baseline, no user testing. The contribution also isn't clearly distinguished from standard clinical decision support design. Treating ML outputs as non-authoritative pending human review is essentially how the FDA already approaches AI-assisted diagnostics. Info hazard: Low–Moderate. Defensive tooling. Main caution is avoiding making uncertain AMR predictions look more decisive than they are, which the system is specifically designed to prevent.
BRAT provides rapid risk assessment for biosafety incidents using adversarial red-teaming. Standard biosafety frameworks assess "what happened" but miss "who could exploit this." BRAT systematically models attacker intent (insider threats, weaponization pathways, information hazards) alongside standard assessment and refuses dual-use requests without revealing what's dangerous. Retrospective analysis of three major biosecurity failures (2014 CDC anthrax: 84 exposed; 2001 anthrax letters: 5 deaths; 2011 H5N1 GOF) shows BRAT's adversarial hypotheses would have correctly identified the actual failure modes and prevented these incidents. Tested on 12 cases with 100% accuracy, 0 false refusals, and 79% of threats identified were non-obvious to humans.
Review 1
Good execution on the website. The paper is clearly vibe-written but does hint at the fact that synthesis companies and governmental health organizations need very clear, strict, guidelines as well as dynamic and thoughtful and dynamic appraisals of the threat models relevant to every case they review. I'm not sure what gap a BRAT with more time and care put into it would fill since I imagine many relevant organizations now have good biosafety requirements, but you should try to ensure that such guidelines are in place if you care about this problem and want to pursue it further.
Review 2
The initial project idea seemed quite compelling, but execution and overall presentation felt rushed and with an inappropriate tone/style. We should be planning more seriously for intentional or deliberate misuse events across the bioscience industry, but in this report: The methodology here was not described in much detail, and I was left unsure what the BRAT tool even consisted of from just reading the report. When I eventually checked out the tool webpage, it didn’t do a great job of convincing me that the tool wasn’t just a black box (despite the tagline - the webpage is literally a black box design). I also think Charlie XCX and biosecurity probably shouldn’t mix, for me that made the project seem a bit unserious. It seems that LLMs were heavily relied on end-to-end in this project including ideation, which was a little worrying. In the introduction, the BRAT tool made by the author is described during the problem setting. Ideally, the problem is described first and independently of the implemented solution/main results, so presentation of the project couldn’t have been much improved. The problem framing was quite surface level and not adequately contextualized. There are sparse references to tools/projects/case studies but the biosecurity landscape is not summarized well - more introduction and scene setting of the status quo of biosafety/biosecurity measures would have improved things. I don’t think retrospective analysis of an event counts as risk assessment. Risk assessment pertains to defined activities, with the assessment conducted prior to the event, with the idea being to implement changes or consider alternative actions that mitigate the **future** risk. This tool seemed like the intention was to evaluate incidents after they have happened (‘BRAT turns a messy incident description into a structured review with concern type, risk level, precedent cases, policy context, and a clear next step’). While I agree that we need to consider deliberate misuse risks more systematically, I doubt that a chatbot is an appropriate solution. Currently, this would not get adopted by any biosafety/biosecurity officer or institution with meaningful biosafety/biosecurity risks. That said, I think the core problem selection was good and that LLMs could play a part in the risk analysis somewhere - I think a more considered analysis of what the options are here would have made for a more epistemically humble and useful report, rather than this tool.
Review 3
The problem BRAT targets is real and the effort to build and deploy a working tool in a hackathon is laudable. The shield of ignorance idea is worth developing further. But the evaluation doesn’t hold up since testing a system on historical incidents where you already know the outcome and then claiming it would have prevented those outcomes is hindsight bias, not validation. A credible test would be to give the system only the information available before the failure, ideally blinded and judged by independent biosafety experts, and see whether its adversarial hypotheses are useful. The 12 hand-selected, author-evaluated cases also can’t support the performance claims made. Scaling to expert-judged, blinded evaluation on novel scenarios would be the single most important next step.
Global travel networks and inconsistent biosecurity policies create unrecognized pathways for pathogen spread. We propose a research‐prototype “Biosecurity Mobility & Policy-Aware Risk Dashboard” that unifies open mobility data (e.g. flights, transit) with regulatory texts and AI reasoning. Using public air-traffic datasets (such as the OpenSky Network’s free real-time and historical flight data) and GTFS transit feeds, our backend infers aggregated travel corridors and frequencies. Simultaneously, a Retrieval-Augmented Generation (RAG) knowledge base indexes biosecurity regulations (e.g. international screening guidelines, export control laws) so that relevant policy excerpts can be retrieved on demand. An LLM (via Ollama) orchestrates multi-step queries: parsing user goals, selecting affected regions by policy context, filtering mobility routes, computing accessibility, and scoring candidate sites. The frontend renders an interactive world map (see figure) highlighting high-risk corridors and regions with mismatched safeguards. Crucially, the system only uses aggregated data and explicitly reports uncertainty – it is not a surveillance tool but a decision-support demonstrator. This work-in-progress prototype for the AIxBio Hackathon (Tracks 2 & 3) shows how AI can help pre-empt pandemics by “connecting the dots” between travel and policy. Early experiments (e.g. simulating COVID-19 spread using OpenSky’s COVID dataset) suggest the approach can flag known outbreak corridors. Our deliverables include the dashboard interface, query API, data pipelines and documentation (see Summary and Timeline). If further developed, this tool could significantly enhance pandemic early warning systems by guiding monitoring and resource allocation, all while adhering to responsible‐AI principles and privacy safeguards.
Review 1
Interesting approach! It's cool to see the attempted integration of travel routes and high-risk corridors. I ultimately think this will be outperformed by targeted surveillance of defined travelers/travel connections, since airports have those data available. In most cases, global travel is too intermixed to benefit from regional-specific surveillance policies. Best we can do is surveil airports and cities and hope we can catch anything as early as possible. But it would be cool to see the prototype, at least!
Review 2
The idea of integrating mobility data with policy context for pandemic preparedness is useful and the responsible-AI framing (aggregated data only, no individual tracking, uncertainty reporting) is good practice however it isn’t novel. The scarcity gaming alignment concept (an AI inflating risk signals to justify continued attention under unlimited-budget assumptions) is an interesting theoretical observation, but it’s undeveloped and feels bolted onto a surveillance dashboard project rather than integrated into it. The most important next step would be to run the system on real data from a historical outbreak and show whether the flagged corridors match actual importation events. That validation would strengthen this into a contribution. The technology stack discussion and architecture diagrams should be a second order priority to demonstrating that the system produces useful output.
Review 3
- interesting that you consider alignment and optimizing for proxy objectives as failure modes -- kudos here! I would put some of these concerns in an appendix or different section for the framing though, as it somewhat distracts from the motivating ideas to present scarcity gaming, Goodhart's Law, etc. as part of the frame. These are probably only going to resonate with people in the AI safety space, and the actual project seems mostly orthogonal to these concerns - on impact, it's not clear to me who is supposed to use this, why they would, and what it would change about their decision-making. I would appreciate more discussion of why this matters, what gap it fills, and comparisons with what already exists for travel monitoring. Bundling things into a risk score could, in some cases, be less useful to someone than showing them travel data directly in a dashboard. - "Where should we allocate testing based on travel networks" is indeed an interesting and decision-relevant question, but I'm not sure if this answers it? How would I answer this question based on this tool? - There should be more description of what goes into the score if this is load-bearing for the information presented to decision-makers. How exactly is the policy gap score computed? The "novelty" score also does not appear to get described at all? This might be the most important piece for answering the decision-relevant questions, so it deserves a deeper treatment. - Overall, using travel data to help prioritize dimensions of an outbreak response seems useful, as does complementing this data with regulatory context. But I think there are some foundational design choices and modeling choices here which might not propagate well to a deployed system. The weakest piece here is the risk scoring methodology, but the engineering around it appears well-executed. Also, one small thing: The repo link is broken, so I can't tell if this is a design sketch or something that was actually implemented
Biosecurity analysts face a growing asymmetry: outbreak reporting volume has expanded substantially while the number of trained personnel able to synthesise that information in real time has not. BioWatch Brief compresses the analyst intake stage from hours to minutes via a three-stage LLM pipeline (structured extraction, retrieval against a curated corpus of historical outbreaks and biosecurity policy frameworks, and grounded analysis), producing a structured risk card from arbitrary input reports. The architecture deliberately separates fact extraction from retrieval and synthesis, constraining LLM outputs at each stage and surfacing uncertainty rather than masking it. Built on gpt-4.1-mini with a curated 21-entry open corpus (16 historical outbreaks, 5 policy/framework documents) normalised across pathogen, location, transmission, response history, lessons learned, and source URLs. React frontend, FastAPI backend, single /analyze_report endpoint. Built at the Apart AIxBio Hackathon, April 2026 (University of Pennsylvania).
Review 1
The problem framing in your introduction is strong, the asymmetry between outbreak reports and analyst capacity is a real chokepoint. The architectural argument in your discussion, that structural fidelity has to be enforced rather than prompted for, is a good contribution worth exploring and developing further. I could see a lot of value in a tool like this existing. The idea is sound, but the current implementation does have limitations. To your credit you identify and flag many of them in the limitations section. The RAG search is tied to key words, and the current corpus is small, so it limits the usefulness but does work for a demonstration for a hackathon. However, the full results are not provided anywhere and only one example is given. Tying one of the categories to geography could be limiting as well. While some viral and bacterial families which can cause a PHEIC are geographically limited, others are not, influenza immediately springs to mind. Geographic anchoring also somewhat limits the effectiveness for bioweapons or engineered pandemics which could cause GCBRs and could, by design, be first unleashed anywhere. One could consider keeping the fields but weighing them differently in the scoring in some manner. The scenario library described in the reports is author and crafted to match exactly what the tool needs, was used during development, and had no held-out set. Section 4 then reflects expected behavior on constructed examples, rather than evidence the system works on inputs it hasn't seen or might be messier. The line that uncertainty flags "correlated with the cases where ground-truth assessment was itself ambiguous to the human authors" is somewhat confusing, and essentially the authors agreeing with themselves. The planning notes /doc on GitHub suggested running 10–15 archived ProMED alerts with documented outcomes as an eval. That would have been great and would have given you real evidence for the report. Even a smaller number, like five archived alerts with retrospective WHO classifications, would substantially strengthen this section and serve as stronger proof BioWatch Brief is behaving and functioning as intended. The linked codebase is somewhat hard to follow compared to the submitted final write up. It looks like there are two distinct pipelines in it? Pipeline.py and main.py? The first has the FastAPI backend using OpenAI's gpt-4.1-mini and seemss to match the staged pipeline described in your Methods section. It also has the keyword-scored retrieval over the fixed corpus and the grounded synthesis. The two LLM round trips. It is also named "main." I judged only on this one, as it seems to be the submission. The second is using the Anthropic SDK and Claude Opus 4.5 with tool use and the output is quite different, but has no RAG/keyword based corpus retrieval. I am not scoring on this pipeline, as the paper seems to indicate the first, but the outputs here seem richer and more informative. So just a flag it could be good to build towards this more in-depth and informative outputs in the submitted, RAG + corpus grounded tool. You also have code for a live signal assessment in the first / main pipeline but that does not seem to be implemented in any way with a way to get live signal, though the planning document has some ideas in it. Not judging on that, just flagging this is a good area to expand and work on if you continue with this tool. A minor quibble but the write-up likely used some resources that are not cited. The mention of BlueDot and Metabiota sent me googling as I was curious if said BlueDot was related to BlueDot impact at all, and the first result in this paper https://pmc.ncbi.nlm.nih.gov/articles/PMC7378493/. Feel like this should have been cited in the report, which then makes me wonder if there are other missing citations. I am not saying there are, just it raises the spectre of it. You do also disclose the use of of LLMs for brainstorming, but the planning MD appears to be generated by Claude / Claude Code and I am sure it was after a lot of back and forth with user(s) but it is more a design spec than just a brainstorming doc. Which I think is fine, that's the way things work now, but maybe could have been disclosed differently. This is such a new area it is hard to say, so not scoring against this at all, just mentioning it.
This project is an attempt to build a pandemic risk monitoring platform for public health experts in policy or fieldwork with four stages: Alert, connecting verified health professional signals, Enrich, a scalable multiturn agent with search gathering external context, Evaluate, a risk assessment model (doubleml) to score risk and confidence of the alert and enriched data, and Recommend, a grounded AI agent recommending actionable response steps to public-health teams for faster response from raw signals to actions recommendations based on explainable models policy makers are familar with. The project ended up scaffolded and is deployable with the pipeline runnable, however, model experiments were not rigorously run nor verified. It serves as an entrypoint to continue transparent open development of explainable, live, grounded, and actionable alerting and response of pandemic risk where speed and transparency matter.
Review 1
Open system is an interesting idea. Who could cover costs? That's main reason the best tools like this so far are not open. For the layer to draw from past decisions: many past decisions in handling emergent outbreaks were not good. How to account for that? It would vary by locality, but for many places that lack other decision support tools, something rooted in the WHO like this could be relatively useful and trusted.
Review 2
This project seems like an ambitious undertaking to me. I want to credit the author for presening it as such and noting that there was a limited amount they could do in a weekend. Assuming that AI agents are "working well" then I do think that a system like this would provide real value. The project at the moment is mostly scaffolding. I wonder if the path to building trust in a process like this would be to start with a smaller chunk of the problem which can be validated.
Review 3
An interesting start on an AI-enhanced pandemic response pipeline, but lacking key implementation details. The need for a new end-to-end solution, rather than supplementing existing tools, is hinted at but not fully explained. More development time is necessary to see results.
Deadly pathogens synthesized in at homelab made with the help of AI have been a rising concern among national security. With this problem in mind, we aim to create a portable hand-held spectroscopy that is able to detect multiple types of peptides and, with the help of AI, detect whether the chemicals are potential precursors to make deadly diseases. We want to create a hardware + software system with embedded AI. We aim to promote our product to the government, FDA, places with a lot of traction, airport security etc. This would reduce chances of people smuggling in dangerous chemicals (created using open weight models) into and outside of the country/state for experiment.
Review 1
There's no working device and no AI here. The "results" in section 4 are dashboard screenshots and three pseudoruns with hand-typed sample outputs. Fig. 4 and 5 are SolidWorks renders of an empty case. The breadboard photo is an Arduino with a PN532 RFID/NFC module sitting on it; a Raman spectrometer is a laser, a spectrograph, and a CCD, none of which are present. I couldn't find a model, a training set, a spectral library, or any inference code in the repo links. The submission is a UI mockup, a 3D enclosure render, and a pitch. The gap between "we have a UI and a CAD file" and "we have a portable detector" is several years of optics, embedded firmware, calibration, and ML work. The paper doesn't acknowledge that gap. Either build the smallest real thing (record one Raman spectrum on a benchtop instrument, classify it, show the result), or reframe honestly as a concept and UX study. The UI work is fine for a hackathon weekend. Login, dashboard, analysis tabs, peptide database with 300 seeded rows. That's the actual deliverable.
Review 2
This is an interesting idea, but there's no apparent evidence for the two most-important claims: (a) That you can detect *peptides* instead of *organisms*. [If the two inline-cited papers make this claim, you need to say exactly *where*. A quick skim of both papers did not make this obvious, and neither paper even mentions the word "peptide."] (b) The *arduino* is cheap, but Raman spectrometers are not. They're usually multiple tens of thousands of dollars. For the intended users (LEO, interdiction, etc) this probably doesn't matter, but what you're basically saying is that you can try to hook into existing devices with a cheap add-on -- although existing devices already have built-in databases with thousands of spectra in them, so *if* you can detect peptide sequences directly, couldn't you just add this to the built-in database in an existing unit? Other issues: (c) You make lots of inline references in the text which aren't in the References section. Worse, you don't actually cite any of the items in the References section in the body of the paper. So I have no idea, for example, why you felt it necessary to cite Astral Codex twice in a paper about Raman spectroscopy. (d) What's with only first names on a paper? If submissions are blinded, no names appear. Otherwise, it's typically assumed that full names appear on a paper, because authorship reputation matters; one needs to be able to find other work, retractions, citations, etc.
BioShield AI is an advanced biosecurity screening system that detects dangerous DNA and protein sequences by analyzing their function rather than just sequence similarity. It uses protein language models, 3D structure prediction, and pathway analysis to catch novel AI‑designed toxins that evade traditional tools. The system includes a five‑station pipeline (functional fingerprinting, domain risk checks, pathway assembly detection, risk scoring with explainability, and an adversarial self‑hardening loop). Compared to existing tools, BioShield AI uniquely identifies functional analogs, explains why sequences are flagged, and continuously improves against evasion attempts.
Review 1
Visually clean, nice, no implementation/data.
Review 2
The proposal correctly identifies a vulnerability in sequence-based approaches to screening with advances in AI-generated design. The proposed solution has already been studied extensively in existing publications and still has critical issues that need to be addressed to deploy into existing screening systems. 1. Using generative protein design tools that require capable GPUs for inference are either prohibitively slow or expensive limiting their application in large scale synthesis screening 2. Existing generative protein design tools are trained on template coding sequences free from regulatory elements, genetic engineering techniques (e.g., 2A peptides, guide RNAs), multiple ORFs, or fused proteins among many other complications in real-world synthetic sequences. Generative protein designs tools struggle to characterize these sequences without extensive pre-processing.
Review 3
This is a proposal, not a project. The only hint of any actual development or evaluation work comes near the end, where the authors briefly state "Phase 1 — Hackathon MVP: Stations 1-2 functional. ESM-2 embeddings + Pfam scanning. Basic risk scoring. Proof-of-concept evasion detection on synthetic test set." However, they provide absolutely no detail or results. As a proposal, it's essentially a laundry list of nearly every idea and approach anyone has ever suggested, strung together as a "pipeline" without addressing whether each element would actually work as advertised and how it would be accomplished. Swept under the rug is any acknowledgement that many of the one-sentence capabilities described in the proposal are hard unsolved problems that will take considerable work to solve, if they can be. Many of the assertions are insufficiently explained for this reviewer to interpret. Key details are omitted. It is unclear whether the authors appreciate the distinction between nucleotide-level obfuscation that maintains the same amino acid sequence (easy to do, easy to detect), amino-acid-level obfuscation that maintains the same protein structure (currently challenging to do and very costly and slow to verify), and structure-level obfuscation that maintains overall protein functionality (ditto). The logic of their claims and design seems to depend on conflating these. The threat scenario is dramatically exaggerated, without acknowledging the difficulty of computationally generating alternative versions of proteins (at the AA or structure level) with desired functionality.
The AI Biosecurity Compliance Auditor operationalises biosecurity through a three-layer engine: automated policy mapping, real-time audits, and consequence models. The platform ingests complex regulatory frameworks and outputs structured lab protocols. Our real-time compliance layer has computer vision to continuously monitor safety practices, while a custom heuristic model generates live risk scores. Designed for high-consequence environments, this hackathon prototype provides a single, auditable workflow for the secure advancement of modern biotechnology.
Review 1
This project is one of the more innovative ideas I've seen; it feels relevant to actual lab experience with biosafety protocols. I'm not sure how realistic/economical it is to have camera monitoring in most typical labs, but it might make sense in BSL-3+ labs. I'm not familiar enough with the level of detail in these SOPs to know if it makes sense to translate with AI, but the computer vision aspect is interesting. I wish the report showed more depth/background on these questions, perhaps with specific examples of biosafety policies and how they would be enforced with this system. That's why I'm giving 2s for the Execution and Presentation.
Review 2
- You hit a lot of topics in your proposal and it's not quite clear how it fits together. I'd recommend focusing on one of the topics and then fleshing that out. It also looks like you didn't have time for a proper writeup, which is fine given the Hackathon timeframe, but it made it tough to understand the core proposal and contribution you make.
Review 3
This sounds like a pretty cool idea, but it needs a lot more development. I'd like to see what you learned by building it, what the limitations were, data on how well it performed, etc.
EPICURUS AI is a system that began as a disease outbreak forecasting tool using historical case data and was expanded during the AIxBio Hackathon with bidirectional pathogen prediction. It bridges three domains: epidemiology, molecular biology, and vector ecology. The system works both ways: epidemiological parameters predict molecular traits, and molecular features predict epidemiological outcomes. This means we can estimate R₀, incubation period, and case fatality rate from genome features alone — critical for assessing AI-generated pathogens before they circulate. Built as a Streamlit prototype with switchable ML models, trained on curated data from WHO GLASS, WHO BPPL, SeqScreen, and peer-reviewed arbovirus datasets. Designed as a proactive defense tool against engineered biological threats.
Review 1
Suggest substantial trimming, adding dual use section, validation/code section with numbers. There's a lot of claims made that aren't supported by any evidence/validation.
Review 2
Can't say the presentation didn't have style, but in the future I would strongly encourage you to move toward briefer write-ups for hackathons! It's much easier for judges to evaluate the technical aspects of the methodology that way. The actual hackathon contribution here doesn't arrive until page 23! Basically everything up to that point is scene-setting, throat-clearing, or stylized preamble that isn't useful for this venue, where the only people likely to read this are very up-to-context on AIxBio. When we do get to the meat of the proposal, it appears like more of a design sketch for what you're going to build than what you did build. There are no results or screenshots presented, and no details on the methods that I can really evaluate. Instead of choosing an algorithm and motivating it, this gets around the problem of carefully choosing a model by just making it a user drop-down. What hypothetical user in a biosecurity decision-making context is going to use a drop-down to decide how to model the data coming in? The idea of mapping molecular features to epidemiological features with the pathogen dataset does make sense, and I'd like to see this kind of model fully fleshed out. I'm not sure running the training the other way with the epidemiological features as predictors really makes it bidirectional, but this is still an interesting idea. I would hesitate to anchor too much on results that popped out of this model, though, without a more thorough treatment of confounds (R_0 does not just depend on molecular properties, for instance, and will vary a lot on population properties that this would not capture). It's also not clear that small models trained on 37 pathogens are going to provide any useful signal for novel pathogens, especially when the relevant features are just tabular. Predicting certain epidemiological features using molecular properties in tandem with info about host, population etc does seem possible and useful, but using small tabular datasets looks unlikely to me to uncover many non-trivial generalizable patterns. I think there's a version of this that would still be cool as a hackathon project, and potentially as a direction to further explore, but the write-up mostly buries or does not provide details to evaluate.