Chasing Shadows:

Pitfalls in LLM Security Research

Jonathan Evertz*, Niklas Risse*, Nicolai Neuer, Andreas Müller, Philipp Normann, Gaetano Sapia, Srishti Gupta, David Pape, Soumya Shaw, Devansh Srivastav, Christian Wressnegger, Erwin Quiring, Thorsten Eisenhofer, Daniel Arp, and Lea Schönherr

* These authors contributed equally to this work.

Network and Distributed System Security Symposium (NDSS), 2026

Large language models (LLMs) are increasingly prevalent in security research. Their unique characteristics, however, introduce challenges that undermine established paradigms of reproducibility, rigor, and evaluation. Prior work has identified common pitfalls in traditional machine learning research, but these studies predate the advent of LLMs. In this paper, we identify nine common pitfalls that can compromise the validity of research involving LLMs. These pitfalls span the whole computation process, from data collection, pre-training, and fine-tuning to prompting and evaluation. We assess the prevalence of these pitfalls across all 72 peer-reviewed papers published at leading Security and Software Engineering venues between 2023 and 2024. We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet, only 15.7% of the present pitfalls were explicitly discussed, suggesting that the majority remain unknown. To understand their practical impact, we further conduct four empirical case studies showing how individual pitfalls can mislead evaluation, inflate performance, or impair reproducibility. Based on our findings, we offer actionable guidelines to support the community in future studies.

How to cite

@inproceedings{evertz-26-chasing,
    title     = {Chasing Shadows: Pitfalls in LLM Security Research},
    author    = {Evertz, Jonathan and Risse, Niklas and Neuer, Nicolai and M{\"u}ller, Andreas and Normann, Philipp and Sapia, Gaetano and Gupta, Srishti and Pape, David and Shaw, Soumya and Srivastav, Devansh and Wressnegger, Christian and Quiring, Erwin and Eisenhofer, Thorsten and Arp, Daniel and Sch{\"o}nherr, Lea},
    booktitle = {Network and Distributed System Security Symposium (NDSS)},
    year      = {2026}
}

Overview

Typical LLM pipeline as considered in the literature, divided into its key stages. Each stage can introduce LLM-specific pitfalls that can distort evaluation, inflate reported performance, or undermine reproducibility. Colors indicate the prevalence of each pitfall, based on the results of our prevalence study

Pipeline overview for stages and mapped pitfalls

Stage 1 — Data Collection and Labeling

LLM pre-training relies on large-scale Internet scraping and increasingly on LLM-as-a-judge for labeling, creating favorable conditions for Data Poisoning (P1) and LLM-generated Label Inaccuracy (P2).

Data Poisoning Label Inaccuracy

Stage 2 — Pre-Training

Opaque pre-training datasets favor overlap between evaluation and training, raising the risk of Data Leakage (P3).

Data Leakage

Stage 3 — Fine-tuning and Alignment

Fine-tuning models with synthetic LLM-generated data and shortcut learning can degrade diversity and generalization, leading to Model Collapse (P4) and Spurious Correlations (P5).

Model Collapse Spurious Correlations

Stage 4 — Prompt Engineering

Fixed context limits and model-specific prompt preferences cause truncation and sensitivity to prompt formats, leading to Context Truncation (P6) and Prompt Sensitivity (P7).

Context Truncation Prompt Sensitivity

Stage 5 — Evaluation

Drawing general conclusions from a limited set of models invites over-broad claims (Surrogate Fallacy, P8) and using models without exact identifiers and access descriptions leads to poor reproducibility (Model Ambiguity, P9).

Surrogate Fallacy Model Ambiguity

Guidelines & Recommendations

Suggestions for improvements are welcome. Please open an issue or pull request if you would like to propose changes.

Current version: 1.0.0 Contribute

P1 — Data Poisoning

Description. A dataset used to train a model is collected from the internet without strategies to verify the integrity and safety of the data.

Recommendation. Researchers should first assess whether data poisoning is relevant to their task and data modality. If data poisoning is both relevant and plausible — for example, when relying on proprietary models or large-scale scraped datasets where training data is not transparent — the risk should be explicitly acknowledged. While verifying the absence of poisoning would be ideal, such guarantees are often unrealistic at scale.

P2 — Label Inaccuracy

Description. LLMs are used to annotate data with certain labels via classification or LLM-as- a-judge procedures without further validation of correctness.

Recommendation. Disclose when labels or judgments come from LLMs. The ideal mitigation is full manual verification. If scale makes that infeasible, conduct a manual audit of a statistically meaningful subset with multiple annotators, reporting inter-annotator agreement and confidence intervals. Less stringent safeguards may be acceptable when human-created labels are used for evaluation and LLM-generated labels appear only in pre-training or fine-tuning.

P3 — Data Leakage

Description. An LLM is trained or fine-tuned with data that is normally not available in practice or the training data is contaminated with possible test data.

Recommendation. Prefer models with known training sources and de-duplicate against evaluation sets where possible. For proprietary models, identify the training cutoff date and determine whether evaluation data (especially labels or answers) was publicly accessible beforehand. When exclusion cannot be ensured, probe for memorization (e.g., completion-style prompting, pre- vs. post-release comparisons) and discuss potential effects.

P4 — Model Collapse

Description. An LLM is trained on data that is generated by other language models, risking an amplification of bias and degradation of data quality.

Recommendation. Clearly report the proportion of synthetic vs. real data and analyze systematic differences. Treat iterative or chained training with extra care, monitoring for compounding effects and verifying performance on fresh human-origin test sets.

P5 — Spurious Correlations

Description. The LLM adapts to unrelated artifacts from the problem space instead of generalizing onto the actual task.

Recommendation. Perform robustness testing via controlled perturbations to suspected features. Use attribution or interpretability methods to inspect what the model focuses on. Include ablations or counterfactuals to test whether performance depends on meaningful evidence rather than shortcuts.

P6 — Context Truncation

Description. The LLM's context size is not spacious enough for its intended task and the input needs to be truncated.

Recommendation. Clearly state the model's context limit. Tokenize representative full inputs to check for overflow. If a switch to a larger context window is not feasible, report truncation frequency and analyze performance vs. input length.

P7 — Prompt Sensitivity

Description. The prompt used to instruct the language models is fixed for all models and experiments or is not expressive enough for the given task. This allows for prompt-based fluctuations in evaluations.

Recommendation. Ideally, optimize prompts per model-task pair. If full optimization is infeasible, perform post-hoc prompt variation experiments to measure stability. Document prompt design decisions and justify fixed-prompt setups where used.

P8 — Surrogate Fallacy

Description. Findings from specific LLMs are often inappropriately generalized to other, often large and more capable models or even to entire classes of language models, without sufficient empirical validation.

Recommendation. Scope claims to the specific evaluated models using precise identifiers. Broader claims require a diverse, representative evaluation set and explicit caveats regarding limits of generalization.

P9 — Model Ambiguity

Description. The model details are insufficient for precise identification, preventing reproducibility (e. g., missing model ID, snapshot, commit ID, quantization level).

Recommendation. Provide precise model identifiers, access method (API vs. web), access dates, and any fine-tuning steps. For open models, include repository and commit hash. Explicitly acknowledge unreproducible components.

Disclaimer: These guidelines summarize common pitfalls observed in recent LLM security research. They are intended as practical support, not strict rules. Applicability may vary depending on task, dataset, and model configuration, and careful judgment is required on a case-by-case basis.