Loading stock data...
Media 026a1551 206e 4e1b 95b1 3ec93a9367ca 133807079769031510Cybersecurity 

AI-generated code could imperil the software supply chain with ‘package hallucinations’—nonexistent dependencies ripe for attacks

A new wave of AI-assisted code generation is revealing a troubling vulnerability: the very dependencies that code relies on can be hallucinated by large language models (LLMs). This creates a fertile ground for supply-chain attacks that inject malicious packages into legitimate software, potentially stealing data, planting backdoors, or enabling other forms of intrusions. The latest research demonstrates that AI-produced software can carry a surprising amount of non-existent or misidentified package references, underscoring the urgency for developers and security teams to rethink how they generate, validate, and incorporate code produced by AI systems.

Understanding package hallucination and its role in software supply chains

Package hallucination refers to the phenomenon where an AI model outputs references to software packages, libraries, or dependencies that do not actually exist in the real world. In the context of AI-generated code, this becomes a critical risk because developers often rely on these neurally suggested dependencies to accelerate work, avoid reinventing the wheel, or ensure compatibility with broader ecosystems. When a model generates a dependency that does not exist, downstream software can end up attempting to fetch and install a non-existent package, leading to installation failures, brittle builds, or, worse, the silent inclusion of a malicious proxy under the guise of a legitimate-sounding dependency. The latter is a classic example of dependency confusion, also known as package confusion, where an attacker publishes a counterfeit package with the same name as a legitimate one but with a newer version or with altered contents, thereby tricking automated systems or human developers into trusting and adopting the malicious component.

This risk is not merely theoretical. The study behind these findings used 16 widely used LLMs to generate a massive corpus of code samples and examined the reliability of their dependency references. In total, the researchers produced 576,000 code samples by running 30 distinct tests, 16 in Python and 14 in JavaScript. Within those samples, the researchers analyzed 2.23 million package references to determine how many pointed to real, existing libraries versus nonexistent ones. The results were stark: 440,445 references—about 19.7 percent—pointed to packages that did not exist. This means that roughly one in five dependencies suggested by AI-generated code could be invalid, dangerous, or misleading, depending on how the developer acts on the AI’s recommendations.

Another important dimension emerges when considering the distribution of these hallucinations across model types. Open-source models, such as those in the CodeLlama and DeepSeek families, showed a higher rate of hallucinations compared with commercial models. On average, open-source LLMs produced about 22 percent of their package references as hallucinations, while commercial models tended to exhibit considerably lower rates—just over 5 percent in the same broad assessment. This dichotomy raises critical questions about how model design, training data, and operational safeguards influence the trustworthiness of AI-generated code. It also points to the need for organizations to weigh the benefits of open-source, customizable AI systems against the elevated risk of non-existent dependencies that such systems may propagate.

The language of the generated code also matters for hallucination risk. Python-based outputs tended to produce fewer hallucinations than JavaScript-based outputs in the study, with Python hovering around 16 percent and JavaScript over 21 percent. This discrepancy is not accidental. JavaScript has a far larger package ecosystem—roughly ten times as many packages as Python—accompanied by a more intricate namespace landscape. As a result, the chances that an LLM will recall or correctly predict a precise package name rise and fall with ecosystem size and namespace complexity. In environments with dense and interconnected package landscapes, the model must recall numerous permutations of package names, versions, and metadata, increasing uncertainty in its internal predictions and, consequently, the likelihood of hallucinated references appearing in generated code.

In essence, package hallucination is not simply an isolated oddity; it is a systemic byproduct of how LLMs are trained, how they generalize from training data, and how they handle tasks that require precise, verifiable knowledge about external software components. The study’s findings illuminate a broader truth about AI-generated software: the very tool intended to accelerate development can inadvertently introduce a spectrum of risks that begin at the source—the code generation process itself.

The research team framed their work in the broader context of ongoing concerns about the trustworthiness of AI-generated outputs. In the software domain, trustworthiness translates into actionable guarantees: component provenance, verifiable dependencies, reproducible builds, and verifiable integrity. The presence of package hallucinations directly undermines these guarantees because it injects uncertainty into the earliest moments of code creation and package resolution. If developers adopt AI-generated code without rigorous verification, they can inadvertently introduce supply-chain vulnerabilities that propagate through downstream systems, potentially affecting end users, enterprises, and critical infrastructure.

The breadth of the study further emphasizes that the problem is not isolated to a single language, library ecosystem, or vendor. It spans multiple programming languages and a mixture of both open-source and commercial model families. When developers rely on AI to suggest dependencies, the risk compounds: non-existent packages may appear in generated code, be integrated into the project, and lead to downstream failures or, more perniciously, become vectors for exploitation if attackers misappropriate the hallucinated names to introduce malicious content. The reliability of AI-generated code thus hinges on robust validation practices that can distinguish real dependencies from non-existent ones before integration into a software project.

Beyond the immediate technical risk, these findings have implications for how teams design workflows around AI-assisted coding. The data shows that the hallucination phenomenon is not merely incidental; it is persistent and repeatable across multiple iterations. Researchers observed that a substantial portion of hallucinated package names recur across more than one query. Specifically, 43 percent of package hallucinations appeared again when researchers examined repeated outputs across ten queries. More strikingly, 58 percent of datasets displayed repeated hallucinated packages in more than one of the ten iterations. This persistence makes the phenomenon especially dangerous for attackers and defenders alike: once a non-existent package name is known to be repeatedly suggested by an AI model, it becomes a predictable vector for exploiting that predictable pattern by publishing a malicious package under that name and hoping developers or automated systems will fetch and install it.

This repeatability also underscores a risk vector for large organizations relying on AI-generated code across dozens or hundreds of projects. In large-scale environments, a relatively small set of hallucinated package names could appear repeatedly in code generated by AI assistants for many developers. If those names correspond to non-existent or intentionally malicious packages, the damage could accumulate quickly across teams and products. Attackers could attempt to take advantage of this repeatability by enumerating known hallucinated names, publishing malicious payloads under those names, and watching as AI-assisted development tools, package managers, and automated pipelines fetch and install the compromised components. The net effect is that a seemingly minor flaw in the model’s output—non-existent dependencies—could have outsized consequences in real-world software ecosystems.

Beyond the raw numbers, the study’s results reveal several notable patterns in how different models and programming environments interact with the problem. The researchers compared open-source LLMs with commercial variants, noting that the former’s larger rate of hallucinations might be tied to differences in parameter counts, training data scale, and the level of safety or policy tuning applied to the models. Commercial models, which often carry substantially higher parameter counts, are thought to benefit from broader training and optimization, but the researchers emphasize that the exact architecture and training details remain proprietary. The observation that commercial models generally exhibit fewer hallucinations than open-source ones persists despite the fact that the relationship between model size and hallucination rate is not straightforward in all cases; some open-source models did not show a clear correlation between size and hallucination rate, suggesting that other factors—data quality, training objectives, and post-training alignment—play critical roles.

In addition to model type, the study explored language-specific dynamics. The Python ecosystem, with its comparatively smaller and tighter package namespace, produced fewer hallucinations on average than JavaScript, which spans a much larger and more fragmented universe of packages. The larger, more complex ecosystem in JavaScript makes precise recall of package names more challenging for AI systems, increasing the probability that the model will generate a non-existent package reference. This insight highlights how the inherent structure of a programming language’s ecosystem can influence the dependable performance of AI-generated code. It also points to a broader principle: the risk profile of AI-assisted coding is context-sensitive, shaped by both the characteristics of the language being used and the design choices of the model and its training regime.

Placed in a broader security context, the findings reinforce the persistent vulnerability of AI-generated outputs. They also align with a wider discourse about the future of software development, in which AI-generated code is expected to become increasingly prevalent. A widely discussed forecast from a prominent technology executive suggests that a substantial share of code could be AI-generated within a relatively short window, underscoring the need for robust governance, verification, and security practices to accompany AI-enabled development. The study’s authors and commentators stress the importance of translating these insights into practical safeguards so that developers and organizations can reap the benefits of AI-assisted coding without exposing their software supply chains to new forms of risk.

In sum, the concept of package hallucination as observed in this study highlights a core tension in AI-assisted development: speed and convenience versus verifiability and security. AI can accelerate the generation of boilerplate, scaffolding, and routine code, but when that acceleration comes with a measurable rate of invalid dependencies, the cost-benefit calculus shifts toward stronger validation, provenance checks, and automated auditing. As the industry continues to embrace AI-assisted coding, the need to address the reliability of AI outputs, particularly in the realm of dependencies and package management, becomes not only a technical concern but a strategic imperative for software integrity and resilience.

The mechanics of dependency confusion and why it matters in AI-generated code

Dependency confusion, a tactic familiar to security practitioners, exploits how software projects resolve and fetch dependencies during build time. When an attacker publishes a malicious package under a name that matches a legitimate package but with a version or content that differs, systems that inadvertently prefer the highest version or the first-available match can end up pulling in the malicious code. In practice, this means an enterprise project can wind up installing a compromised component, even if the organization adheres to best practices for trusted sources and private registries, simply because the AI-generated code includes a reference to a name that matches a popular public package. Attackers have exploited this gap by piggybacking on the trust developers place in package names and the automatic resolution mechanisms of package managers.

The relevance of this attack vector to AI-generated code lies in the way LLMs operate when they propose dependencies. When a model outputs a dependency name—the name of a library or package—it often lacks the real-world context that a human developer would bring to judgment about what constitutes a trustworthy or safe dependency. An attacker could exploit this by reserving the same name in a public registry with a malicious payload or by publishing a version that appears more recent or more feature-rich. If the model’s outputs align with these attacker-created artifacts, developers may accept and install the counterfeit package, especially if the package name is suggested by an AI system that users trust to optimize their workflow. The result can be a supply-chain compromise that begins at the earliest steps of code creation, propagating through builds, deployments, and downstream usage.

The historical dimension of dependency confusion adds gravity to the risk. The concept was demonstrated in a high-profile proof-of-concept several years ago, where counterfeit code was executed on real networks belonging to major companies, including high-visibility names in the technology sector. The lessons from that demonstration are instructive: once an attacker succeeds in introducing a counterfeit dependency into a project’s software supply chain, the malware can lie dormant, waiting for the appropriate trigger—such as a specific build or deployment—before executing. The combination of AI-generated code and dependency confusion can, in theory, magnify the impact of such attacks by increasing the frequency with which developers encounter questionable package recommendations and by expanding the pool of potential targets who might inadvertently install the malicious payload.

From a defender’s perspective, dependency confusion creates a framework for risk assessment and defense strategies that must adapt to AI-assisted coding realities. Traditional supply-chain defense mechanisms—like strict version pinning, robust provenance checks, and strict verification of dependency integrity—are only as effective as the rigor with which they are applied. If AI-generated code enters a project without adequate safeguards, the risk surface expands dramatically. A single hallucinated package reference can undermine the integrity of a wide range of downstream processes, from build systems to continuous integration pipelines. This means security teams must elevate their monitoring, add AI-output verification steps, and implement stronger controls around how dependencies are selected and validated in AI-assisted workflows.

The practical implications of dependency confusion in AI-assisted code extend to how teams structure their development pipelines. Teams may need to enforce multi-step verification for AI outputs, including manual review of generated code sections that reference external packages, automatic cross-checks against authoritative registries, and automated scans for non-existent or suspicious dependencies. Additionally, organizations might adopt policy-based gating that prevents the installation of any dependency unless it can be explicitly verified through a chain of evidence that demonstrates its provenance and integrity. In environments where AI-generated code is a routine input, these safeguards become central to maintaining software quality, reliability, and security.

One of the key takeaways from the study is that the hallucinated dependencies are not random one-off errors; they exhibit a pattern that is repeating and predictable. This repeatability makes them particularly valuable for malicious actors who wish to exploit a known, stigmatized set of non-existent packages to spread malware across many projects. The persistence of hallucinated names across multiple queries suggests that once a questionable dependency emerges in the model’s internal predictions, it can reappear again and again. For security professionals, this means that static and dynamic analysis tools, as well as human reviewers, should focus not only on current outputs but also on historical patterns of model behavior to anticipate where a model might produce recurring invalid dependencies.

The intersection of AI-driven code generation and dependency confusion underscores the need for proactive, end-to-end security practices that embrace the realities of AI-assisted development. A comprehensive approach should include, among other measures, robust package naming hygiene in registries, stronger namespace separation to minimize the risk of name collisions, and explicit policy controls that require human verification for any non-existent or suspicious-sounding dependency. In practice, this means developers should be trained to scrutinize AI-proposed dependencies, ensure that every referenced package exists in the intended registry, and only install packages after cross-checking with authoritative sources and build-time provenance data. Such workflows, although potentially more time-consuming, are essential to preserving the integrity of modern software supply chains in an era when AI-born code is increasingly common.

Study highlights: large-scale testing, languages, and model types

The study’s design was extensive and methodical, enabling researchers to capture a wide range of scenarios that reflect real-world AI-assisted coding conditions. The team executed thirty distinct tests to explore how LLMs generate code and manage dependencies. Of these tests, sixteen focused on Python and fourteen on JavaScript, two of the most widely used programming languages in contemporary software development. Each test generated 19,200 code samples, giving a total of 576,000 code samples across all tests. The sheer volume of data provided a robust statistical foundation to analyze the frequency and distribution of dependency references, including the emergence of hallucinations.

In terms of content, the study examined 2.23 million package references embedded within those code samples. Out of these, 440,445 references—representing 19.7 percent of the total—pointed to packages that did not exist. Among these hallucinated references, 205,474 had unique package names, highlighting that a substantial portion of non-existent dependencies are repeated across different contexts but are still distinct in their naming. The implications of this finding are significant because it shows that both repetition and variation exist within the hallucination landscape. Attackers could exploit these patterns by focusing on non-existent package names that are repeatedly suggested by AI models, enabling them to push potentially malicious content into a broad set of projects that rely on AI-generated code.

Another important nuance revealed by the analysis concerns the rate at which hallucinations repeat across iterations. The researchers found that 43 percent of package hallucinations were repeated over ten queries. More strikingly, 58 percent of hallucinated packages appeared more than once across ten iterations. This indicates that hallucinations are not simply random errors generated by chance in the model’s output; rather, they reflect a persistent phenomenon that endures across multiple interactions. Such persistence makes hallucinated package names even more valuable to malicious actors, who can leverage consistent patterns to reliably target a broad developer audience and insert malware via those recurring non-existent dependencies.

The study also sheds light on disparities across models and ecosystems. Open-source LLMs displayed a higher hallucination rate than commercial models on average, with the former showing around 22 percent package hallucinations compared to just over 5 percent for commercial models. The discrepancy invites a closer look at the factors behind hallucination rates, including model size, training data quality, and the calibration of safety and alignment objectives. The researchers noted that commercial models typically operate with substantially larger parameter counts, often estimated to be at least ten times larger than those of the open-source models they tested. However, they cautioned that the exact architectures and training specifics of these commercial variants remain proprietary. This complexity makes direct causal attribution difficult, but the overall trend points to a meaningful difference in demonstrated reliability between model families.

When considering language-specific outcomes, JavaScript consistently showed a higher rate of hallucinations than Python. The researchers attributed this partly to the relative abundance of packages in the JavaScript ecosystem and the complexity of its namespace management. The larger and more intricate landscape in JavaScript packages introduces more opportunities for confusion, misrecall, and misinterpretation by AI systems during dependency generation. Consequently, developers working primarily in JavaScript may encounter higher probabilities of hallucinated dependencies when relying on AI-generated code, compared with those working in Python who benefit from a smaller, perhaps more tightly curated ecosystem with a less dense namespace space. This finding reinforces the idea that the risk profile for AI-assisted coding is not uniform across languages; it is shaped by ecosystem structure and the way package management is organized within each language’s tooling environment.

The researchers also explored the broader drivers behind these patterns, looking at factors such as model size, the diversity and scope of training data, instruction tuning, and safety safeguards that influence model behavior. They observed that differences between commercial and open-source models are not solely a function of parameter count; training data quality and the presence of instruction-following, safety, and alignment practices likely play a critical role in shaping the propensity for hallucinations. In particular, the bigger, more data-rich commercial models may leverage richer priors and more rigorous alignment protocols, which can help suppress erroneous outputs in the form of nonexistent dependencies. Yet, the authors emphasize that there is no simple linear relationship between size and reliability; the reality is more nuanced and depends on a combination of data sources and post-training processes that steer model outputs toward safer, more verifiable results.

One of the most consequential takeaways from the study is not just the existence of package hallucinations but the fact that these hallucinations can be persistent across different contexts and even across multiple iterations of prompting. This persistence implies that the risk is not merely a product of a single anomaly in a model’s output but a stable pattern that can be anticipated and addressed with appropriate safeguards. The practical implication is that organizations must invest in comprehensive validation pipelines that scrutinize AI-generated outputs for the presence of real, verifiable dependencies before those outputs are adopted into any build or deployment workflow. In the face of such persistence, ad hoc checks or a single manual review are unlikely to be sufficient; rather, a disciplined approach to dependency validation must be embedded into AI-assisted development processes.

Beyond these technical findings, the study’s results contribute to the broader narrative of AI reliability and trust in software development. The leading voices in the field have suggested that large-scale AI adoption in coding may reshape the landscape of software engineering within a few years. The notion that a large percentage of code could be AI-generated within a five-year horizon has been widely discussed by industry leaders. In this context, the study’s conclusions underscore a critical challenge: the need to align AI development with robust security measures that can prevent fresh vectors of compromise as AI becomes more deeply integrated into the software delivery lifecycle. The authors, and the security community at large, advocate for a combination of defensive tactics, policy controls, and developer education to mitigate the risks associated with dependency hallucinations and the broader phenomenon of AI-generated code that might inadvertently open doors to supply-chain exploitation.

Why model size, data, and safety tuning influence package hallucination rates

The observed differences in package hallucination rates across model families can be traced to several core factors: model size, training data breadth, fine-tuning strategies, instruction training, and safety-focused post-processing or alignment protocols. The researchers highlighted that large commercial models typically have significantly more parameters than their open-source counterparts, and this scale advantage is likely a contributor to their reduced propensity to hallucinate non-existent dependencies. The intuition is that larger models can capture more robust patterns in the training data, better differentiate between existing and non-existent dependencies, and maintain a heightened level of internal consistency when dealing with naming conventions across a broad set of libraries and packages.

However, the relationship is not unidirectional or purely size-driven. The training data used to create an LLM — the actual corpora, code repositories, package registries, and associated metadata — profoundly shapes how accurately the model can recall dependencies and recall them with fidelity. The study notes that the exact architecture and training specifics for commercial models remain proprietary but emphasizes that the differences in data and alignment processes likely contribute to the observed performance gap. Fine-tuning for instruction following and safety alignment can either reduce or inadvertently increase hallucinations depending on how those processes interact with the model’s internal predictive mechanisms. In some instances, overly aggressive safety tuning could potentially dampen certain outputs in ways that inadvertently cause the model to substitute non-existent dependencies that seem safer or more plausible within the model’s learned distributions.

Another dimension the researchers considered is the breadth and granularity of the model’s training objectives. Instruction tuning often emphasizes producing outputs that appear coherent and useful, but it may not always prioritize the factual accuracy of each component within AI-generated code, especially when the model is asked to generate dependencies spontaneously. If instruction-following prompts encourage the model to produce a complete, ready-to-run code snippet, the model may be incentivized to complete the surrounding context with plausible-sounding package names, even if those names do not correspond to actual packages. The safety tuning, designed to curb risky or harmful outputs, can also shape the model’s behavior in complex ways that either suppress or fail to suppress hallucinations, depending on the specific safety constraints and implementation details.

In addition to model size and training discipline, ecosystem characteristics play a critical role. The size and complexity of a programming language’s package ecosystem influence how reliably a model can recall precise package names. JavaScript’s ecosystem, being larger and more interconnected, introduces more opportunities for confusion and misremembered dependencies. The sheer volume of available packages increases the probability that a given package name a model generates may not correspond to an actual library in the registry. Conversely, Python’s ecosystem, while substantial, is more bounded and, in some contexts, more tightly curated, which can help the model achieve higher recall accuracy for package names while still encountering non-existent references in other contexts.

The study’s observations imply a broader pattern: reliability in AI-generated code does not stem solely from model scale or intelligence. It emerges from a synergy of how models are trained, how they are aligned to user objectives responsibly, how their outputs are validated, and how developers interact with the outputs. The presence of package hallucinations is a symptom of deeper design choices and ecosystem interactions that developers and organizations must consider when integrating AI into engineering workflows. The takeaway is not merely to chase bigger models or more data; it is to cultivate robust validation and governance practices that can counter the inherent risk of hallucinations in AI-generated code.

The researchers’ insights about language-specific differences have practical implications for teams choosing their AI-assisted tooling. If a project is primarily Python-based, teams might expect a lower baseline rate of hallucinations, but they should still be vigilant about the possibility of non-existent dependencies slipping into the code. If a project relies heavily on JavaScript, the risk is higher, and teams must implement stronger checks around dependency naming, signature validation, and provenance verification. Moreover, the results suggest that organizations should not assume that using commercial models automatically guarantees the most reliable output. Even with lower measured hallucination rates, AI-generated code should be treated as requiring ongoing verification and auditing, particularly around modular, third-party dependencies.

Taken together, these findings contribute to a nuanced understanding of the practical security implications of AI-assisted coding. They underscore that the deployment of AI tools in software development cannot be decoupled from the broader governance, risk management, and security practices that govern how dependencies are selected, validated, and deployed. They also highlight a critical area for future research and development: the creation of robust, automated methods to detect and mitigate package hallucinations as part of the standard CI/CD workflow. By embedding these checks into the build and deployment pipelines, organizations can reduce the risk that AI-generated outputs will introduce non-existent or malicious dependencies into production software.

Implications for software security, developers, and organizations

The emergence of package hallucination as a tangible security concern reshapes how developers and organizations think about AI-assisted coding. The potential for non-existent dependencies to appear in AI-generated code introduces a novel vector for supply-chain risk that must be addressed through comprehensive, layered defenses. Organizations must consider reinforcing their software supply chain with stronger governance around dependency management, including:

  • Proactive validation of AI-generated outputs: Implement automated verification to confirm that all dependencies referenced by AI-generated code actually exist in trusted package registries and have verifiable provenance.
  • Enhanced dependency provenance: Require strict traceability for all dependencies, including version histories, authorship, and integrity checks, to ensure a reliable chain of custody from source to build to deployment.
  • Reproducible builds and lockFiles: Ensure builds are deterministic by pinning dependency versions and using lockfiles that capture exact versions used in a given build, reducing the risk that a non-existent or malicious package is substituted during resolution.
  • Namespace hygiene and package naming policies: Apply namespace controls to reduce the risk of accidental or deliberate name collisions that could lead to dependency confusion. Develop governance around how names are selected and used in AI-generated code to minimize the creation and propagation of risky or ambiguous names.
  • Developer education and process integration: Train developers to review AI-generated dependency suggestions critically, verify their existence, and avoid auto-installation of packages suggested by AI without proper verification steps.
  • Secure pipelines and scanning tools: Integrate security scanning that specifically checks for non-existent or suspicious dependencies in AI-generated outputs, and incorporate continuous monitoring to flag unusual patterns or repeating hallucinations within dependency lists.
  • Language- and ecosystem-aware strategies: Recognize that JavaScript projects may require additional safeguards due to the ecosystem’s size and complexity, while Python projects can benefit from strong, centralized registries and validated dependency trees.

The study’s results also reinforce the broader lesson that AI-generated outputs require continuous, end-to-end controls. Even if a model’s outputs appear coherent and plausible, there is no guarantee that every component is real or safe. Security practices need to evolve alongside AI capabilities, ensuring that new code, especially when generated by AI, is subject to the same rigorous checks as hand-written code. This includes not just verifying the existence of dependencies but also assessing their provenance and integrity in a consistent, auditable manner.

For developers, the practical takeaway is that AI can be a powerful productivity booster, but it does not absolve engineers of due diligence. The presence of hallucinated dependencies means that every AI-generated code segment that references an external library must be treated as a potential risk until confirmed safe. This requires integrating AI outputs into a disciplined workflow that includes human oversight, robust test coverage, and infrastructure-level safeguards to prevent the accidental introduction of non-existent or malicious components into production systems.

Organizations should also acknowledge that the landscape of AI-generated code is evolving. Model improvements, better alignment techniques, and more sophisticated safety mechanisms may reduce hallucination rates over time. However, adoption pace must be matched with corresponding risk management investments and policy development. In the near term, security and engineering teams must anticipate that AI-generated code will carry non-trivial risks related to dependency hallucinations and supply-chain vulnerabilities, and they should design processes that specifically address these risks.

Practical steps to mitigate package hallucination risks in AI-assisted coding

To translate these insights into actionable practices, teams can implement a layered approach to mitigate package hallucination risks. The following steps summarize a practical, security-focused pathway:

  • Step 1: Implement AI-output verification at the source. Before any AI-generated code is added to a codebase, run an automated check to identify and validate all dependencies it references. If any dependency is non-existent or cannot be confidently verified, flag the output for human review and revise accordingly.
  • Step 2: Build robust provenance for dependencies. Maintain a comprehensive log of dependency sources, including the exact package registry, the version, the release date, and the cryptographic integrity hash. This provenance should be easily auditable and linked to each build artifact.
  • Step 3: Enforce strict dependency resolution policies. Use private registries or trusted mirrors for critical components, coupled with strict allowlists that restrict which dependencies can be resolved. Consider implementing negative lists to block known non-existent or dangerous packages.
  • Step 4: Emphasize reproducible builds and lockfiles. Ensure that every build uses a pinned set of dependency versions, captured in lockfiles, so that downstream environments build deterministically and are not susceptible to changes in public registries or non-existent packages appearing in future iterations.
  • Step 5: Adopt namespace governance and naming conventions. Establish clear naming rules for packages and internal dependencies, reducing ambiguity and the chance of conflict with public libraries. Enforce checks that AI-generated outputs do not propose names that collide with high-risk or ambiguous entries without validation.
  • Step 6: Integrate multi-layer verification within CI/CD pipelines. Incorporate tools that specifically scan for non-existent dependencies, verify package existence in official registries, and cross-reference with known-good baselines. Add a security gate that halts builds if suspicious dependency patterns are detected.
  • Step 7: Invest in human-in-the-loop validation. While automation is essential, human review remains a critical line of defense for AI-generated code, especially when dependencies are involved. Create review queues for AI-generated components with potential dependency concerns, and provide clear criteria for escalation and remediation.
  • Step 8: Monitor for repeatable hallucination patterns. If the AI system demonstrates repeated generation of certain non-existent package names, treat those patterns as high-priority risk signals. Develop targeted checks to catch these recurring artifacts and block or remediate them before integration into codebases.
  • Step 9: Align AI usage with labeling and policy requirements. Maintain transparency about when AI assistance is used in coding, and annotate AI-generated segments accordingly. This makes it easier to apply policy controls, conduct audits, and improve accountability across teams.
  • Step 10: Prepare for ecosystem-specific risk management. Customize risk mitigation strategies to the ecosystem in use. For JavaScript-heavy teams, deploy extra scrutiny for package naming and ecosystem breadth; for Python teams, emphasize tight version pinning and robust registry verification to minimize hallucination impact.

This practical framework is intended to be adaptable and scalable, capable of evolving as AI systems advance and as new types of hallucinations or supply-chain risks emerge. It recognizes that the most effective defense against package hallucination is not a single tool or practice but a comprehensive, layered approach that encompasses people, processes, and technology. By embedding verification early in the AI-assisted workflow and reinforcing it with governance and monitoring, organizations can significantly reduce the likelihood that AI-generated code will introduce non-existent dependencies or create new attack surfaces in the software supply chain.

The broader context: AI-generated code, trust, and the path forward

As AI-generated code becomes more common, the software industry faces a pivotal question: how to balance the speed and scale benefits of AI assistance with the equally critical need for software integrity and security. The findings about package hallucinations contribute to a broader discourse about trust in AI-assisted development. If AI can generate code rapidly but also introduce a meaningful probability of non-existent or malicious dependencies, the burden of responsibility shifts toward developers and organizations to implement rigorous controls that can detect and remediate such risks before they affect users or systems.

Industry leaders are actively exploring how to integrate AI into development pipelines without compromising security. Some advocate for stronger policy and governance around AI usage in coding workflows, including explicit obligations to validate AI outputs and to document provenance for every dependency introduced through AI generation. Others emphasize the need for improved tooling that can automatically detect, flag, and block hallucinations, ensuring that builds are reproducible and auditable no matter how code is produced. The consensus is that AI is here to stay as a powerful helper, but its outputs must be treated as untrusted until proven trustworthy through rigorous checks and ongoing monitoring.

The implications extend beyond the technical realm into organizational culture and risk management. Security teams must collaborate closely with software engineers to embed AI risk controls into everyday practices. This means designing training programs that emphasize dependency validation, constructing governance models that require independent verification for AI-generated components, and investing in infrastructure that can support continuous scanning, provenance verification, and incident response in the event of a compromised dependency. It also means acknowledging the limits of AI: while these models can accelerate development, they do not inherently guarantee correctness or security, especially when they interact with complex, real-world software ecosystems.

In the larger arc of AI-enabled software development, the study’s findings highlight a critical gap between the ambition of AI-assisted coding and the practical requirements of secure software delivery. The fact that significant fractions of AI-generated dependencies are hallucinated calls for a recalibration of expectations: AI can be a powerful assistant, but the discipline of software engineering—careful design, rigorous verification, and proactive risk management—remains essential. The path forward is not to abandon AI in development but to elevate the standards by which we use AI, by building robust checks into the machine-assisted workflow, and by treating AI outputs with the same scrupulous scrutiny that human-produced code warrants.

The study’s authors and the broader security community stress that these insights should inform not only immediate defensive measures but also ongoing research and development in AI safety and reliability. As AI models continue to evolve, it is plausible that new safeguards could reduce hallucination rates or enable more precise recall of real-world dependencies. The challenge lies in ensuring that such improvements translate into tangible security gains in real-world software delivery, while preserving the creativity and productivity benefits that AI can unlock. The conversation thus extends beyond academia and industry conferences into the daily routines of software teams, who must translate these lessons into practical, sustainable practices that keep their software secure even as AI accelerates development.

In a landscape where AI-generated code is poised to become more prevalent, the lessons from this study provide a concrete roadmap for risk-aware adoption. By understanding the nature of package hallucinations, recognizing their recurrence patterns, and implementing comprehensive, multi-layered mitigations, organizations can boldly pursue the benefits of AI in software development while safeguarding the integrity of their supply chains. The goal is to enable developers to write faster, smarter code without inadvertently inviting new classes of vulnerability into production systems. As the ecosystem evolves, continuous learning, vigilant validation, and proactive governance will be the keystones of securely harnessing AI’s potential in software engineering.

Conclusion

The research presents a sobering verdict: AI-generated code frequently references non-existent dependencies, a problem that is not fleeting but persistent across multiple models, languages, and iterations. The implications for software supply chains are substantial, particularly because a significant portion of hallucinations repeats across many queries, creating repeatable attack surfaces that could be exploited at scale. Dependency confusion emerges as a central threat vector in this context, since counterfeit packages can masquerade as legitimate components and exploit the trust developers place in automated tooling. The study’s data shows that open-source models tend to produce more hallucinations than commercial counterparts, and JavaScript code is especially susceptible due to ecosystem size and complexity. Yet the picture is nuanced: while model size and training practices influence the rate of hallucinations, other factors such as data quality, alignment strategies, and ecosystem characteristics also play critical roles.

For developers and organizations, these findings translate into clear, actionable steps: validate AI outputs for real dependencies, enforce provenance and reproducibility, adopt strict namespace governance, and integrate layered defenses into CI/CD pipelines. The path forward involves embracing AI as a powerful ally while implementing rigorous safeguards that ensure the reliability and security of the software supply chain. As AI continues to reshape software development, the industry must translate these insights into robust practices and continuous improvements, ensuring that the benefits of AI-assisted coding do not come at the expense of security or trust. The ultimate objective is to build a future where AI-enabled development accelerates progress without compromising the integrity of the software that society increasingly relies upon.

Related posts