Chasing Shadows:

Pitfalls in LLM Security Research

Jonathan Evertz*, Niklas Risse*, Nicolai Neuer, Andreas Müller, Philipp Normann, Gaetano Sapia, Srishti Gupta, David Pape, Soumya Shaw, Devansh Srivastav, Christian Wressnegger, Erwin Quiring, Thorsten Eisenhofer, Daniel Arp, and Lea Schönherr

* These authors contributed equally to this work.

Network and Distributed System Security Symposium (NDSS), 2026

Large language models (LLMs) are increasingly prevalent in security research. Their unique characteristics, however, introduce challenges that undermine established paradigms of reproducibility, rigor, and evaluation. Prior work has identified common pitfalls in traditional machine learning research, but these studies predate the advent of LLMs. In this paper, we identify nine common pitfalls that can compromise the validity of research involving LLMs. These pitfalls span the whole computation process, from data collection, pre-training, and fine-tuning to prompting and evaluation. We assess the prevalence of these pitfalls across all 72 peer-reviewed papers published at leading Security and Software Engineering venues between 2023 and 2024. We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet, only 15.7% of the present pitfalls were explicitly discussed, suggesting that the majority remain unknown. To understand their practical impact, we further conduct four empirical case studies showing how individual pitfalls can mislead evaluation, inflate performance, or impair reproducibility. Based on our findings, we offer actionable guidelines to support the community in future studies.

Read the Paper Get the Data

How to cite

@inproceedings{evertz-26-chasing,
    title     = {Chasing Shadows: Pitfalls in LLM Security Research},
    author    = {Evertz, Jonathan and Risse, Niklas and Neuer, Nicolai and M{\"u}ller, Andreas and Normann, Philipp and Sapia, Gaetano and Gupta, Srishti and Pape, David and Shaw, Soumya and Srivastav, Devansh and Wressnegger, Christian and Quiring, Erwin and Eisenhofer, Thorsten and Arp, Daniel and Sch{\"o}nherr, Lea},
    booktitle = {Network and Distributed System Security Symposium (NDSS)},
    year      = {2026}
}

Overview

Typical LLM pipeline as considered in the literature, divided into its key stages. Each stage can introduce LLM-specific pitfalls that can distort evaluation, inflate reported performance, or undermine reproducibility. Colors indicate the prevalence of each pitfall, based on the results of our prevalence study

Stage 1 — Data Collection and Labeling

LLM pre-training relies on large-scale Internet scraping and increasingly on LLM-as-a-judge for labeling, creating favorable conditions for Data Poisoning (P1) and LLM-generated Label Inaccuracy (P2).

Data Poisoning Label Inaccuracy

Stage 2 — Pre-Training

Opaque pre-training datasets favor overlap between evaluation and training, raising the risk of Data Leakage (P3).

Data Leakage

Stage 3 — Fine-tuning and Alignment

Fine-tuning models with synthetic LLM-generated data and shortcut learning can degrade diversity and generalization, leading to Model Collapse (P4) and Spurious Correlations (P5).

Model Collapse Spurious Correlations

Stage 4 — Prompt Engineering

Fixed context limits and model-specific prompt preferences cause truncation and sensitivity to prompt formats, leading to Context Truncation (P6) and Prompt Sensitivity (P7).

Context Truncation Prompt Sensitivity

Stage 5 — Evaluation

Drawing general conclusions from a limited set of models invites over-broad claims (Surrogate Fallacy, P8) and using models without exact identifiers and access descriptions leads to poor reproducibility (Model Ambiguity, P9).

Surrogate Fallacy Model Ambiguity

Guidelines & Recommendations

Suggestions for improvements are welcome. Please open an issue or pull request if you would like to propose changes.

Current version: 1.0.0 Contribute

Disclaimer: These guidelines summarize common pitfalls observed in recent LLM security research. They are intended as practical support, not strict rules. Applicability may vary depending on task, dataset, and model configuration, and careful judgment is required on a case-by-case basis.