Most development teams today use at least one AI-assisted coding tool — GitHub Copilot, Cursor, Claude Code, or some IDE integration. Data shows that around 60% of new code in professional settings is now AI-assisted. Yet the conversation inside teams tends to go like this: the senior developer says "it saves me hours", the junior developer says "sometimes it creates more problems than it solves", and the tech lead says "I have no idea how to measure it". They're all right.
This article is not an advertisement. We'll look at where coding agents deliver measurable, real-world acceleration, where they're dangerous, and how to set up their use so the gains are visible and the risks are under control.
What a coding agent actually does — and doesn't do
Before comparing tools, it's important to understand the fundamental difference between inline autocomplete (GitHub Copilot in basic mode, Cursor tab) and agentic mode (Claude Code, Cursor Agent, Copilot Workspace).
Inline autocomplete fills in code based on the context in the current file. It's fast and deterministic in the sense that you're typing and it's suggesting. The risk is low — a bad suggestion is simply ignored.
Agentic mode has access to the entire repository: it can read files, run terminal commands, modify multiple files at once, and iterate on results. Claude Code, for example, works directly in the terminal, reads project context, runs tests, and fixes errors until the tests go green. Cursor Agent has access to the full workspace. Copilot Workspace opens multi-file editing over a GitHub repository.
Agentic mode is dramatically more powerful — and dramatically more prone to errors you won't notice unless you actively check.
Where coding agents actually accelerate work
Boilerplate and scaffolding
This is the category where ROI is clearest and least controversial. A new REST API endpoint, a CRUD handler, a database migration, a Docker Compose configuration, a CI/CD pipeline, a TypeScript interface from a JSON schema — these are tasks where the developer knows exactly what they want, they just don't want to write twenty lines of repetitive code.
Agents replace manual work with low cognitive content. With manufacturing clients implementing REST wrappers over legacy SCADA systems, we see scaffolding a new integration layer drop from two hours to twenty minutes. Tests aren't covered, but the skeleton is done and correct.
Generating tests for existing code
Writing tests is a task where most teams carry technical debt. Agents are surprisingly good at it — if you give them an existing function and ask for unit tests covering edge cases, the output is largely usable in practice without major changes. The remainder needs tuning, but the foundation is there.
One important caveat: agents generate tests that test the current behavior — not the behavior that should be correct. If the code has a bug and the agent writes a test, that test will codify the bug as expected behavior. Tests must go through review like any other code.
Exploring and understanding unfamiliar code
This is an underrated use case. A new developer joining the team, or a veteran landing in a part of the codebase nobody has touched in years. An agentic tool with repository access can answer questions like "how does authentication work in this application", "where are inputs validated before writing to the DB", or "what happens if this call fails".
Claude Code is particularly strong here because it operates in the terminal and can directly grep, read multiple files at once, and build up context. The acceleration in onboarding new developers is measurable.
Documentation and comments
Generating docstrings, README files, API documentation, or a changelog from a diff — less exciting work, but time-consuming. Agents handle it competently and the result is usually better than whatever would otherwise be there (often nothing).
Small-scale refactoring
Renaming variables across a file, extracting a function, unifying formatting, converting callback code to async/await — agents handle this well because the transformation is local and mechanical. The result is verifiable at a glance or through tests.
Where coding agents are unreliable
Complex refactoring of large legacy codebases
This is the most common source of disappointment. A team decides to use an agent to "refactor" a 50,000-line PHP monolith. The result: the agent proposes changes that look clean locally but break unpredictable dependencies elsewhere. Without 100% test coverage (which legacy codebases typically don't have), there's no safe way to verify whether the agent broke anything.
A practical rule: agentic refactoring is only safe when reliable tests exist for the code being changed. Without tests, it's a gamble.
Security-sensitive code
Authentication, authorisation, cryptography, SQL query builders, input validation — these are areas where blind trust in AI-generated code is dangerous. Not because agents make primitive mistakes, but because they make unexpected assumptions. We've seen cases where an agent generated seemingly secure code with an implicit trust boundary that was incorrect.
Code in these areas must go through security review regardless of whether it was written by a human or an agent. You can use an agent for a first draft — don't skip the review.
Architectural decisions
Agents are good at implementation, not at designing architecture. Ask Claude Code to "design an architecture for an event-driven system with 10 microservices" — you'll get something that looks reasonable, but without context about the company's specific requirements, existing infrastructure, and future plans, it will be a generic template, not a sound decision.
Architecture remains the domain of experienced engineers.
Debugging complex race conditions and timing problems
Agents don't have a strong grasp of time, parallelism, and state across distributed systems. They can help with log analysis or suggest a hypothesis, but relying on them to debug a race condition in a production Kubernetes cluster would be a risky approach.
Tools — what to expect from each
The three main tools differ more than they might appear to:
- GitHub Copilot is the most widely adopted (26M+ users), integrated directly in the IDE, primarily inline autocomplete. Copilot Workspace adds agentic editing across a repository. Lowest barrier to entry for teams already on GitHub Enterprise.
- Cursor takes an editor-first approach — a full IDE, not a plugin. Cursor Agent has strong access to workspace context and works well with large projects. Popular among frontend developers.
- `Claude Code` operates in the terminal and is agentic from the ground up — reading files, running commands, iterating on tests. Stronger for backend and systems work, less intuitive for developers who prefer a GUI. Its user base grew extraordinarily fast — from zero to a billion-dollar ARR run-rate in roughly nine months, a historical record for a developer product from Anthropic.
None of them is objectively "the best" — it depends on workflow, language, and what the team needs.
For industrial environments with on-premises requirements: none of the tools listed above is natively on-prem. For strict data-security requirements (regulated industries, sensitive IP), consider locally deployed models via Ollama or vLLM with IDE integrations like Continue.dev — that's a separate architecture we cover in the article on local LLMs vs cloud.
How to measure the impact — concrete metrics
Most teams can't say whether coding agents are helping them because they don't measure. Here are metrics that make sense:
Time-to-first-merged-PR for new features: teams combining an inline tool with an agentic tool show 2–3× improvement on this metric for greenfield work, based on available data. For brownfield work the improvement is smaller and less predictable.
Test coverage trend: if the team uses an agent to generate tests, track whether coverage grows. If it doesn't, agents are generating tests that aren't being committed — either they're poor quality or there's no clear workflow.
Code review rejection rate: if agent-generated code comes back from review significantly more often, that's a signal that there's no adequate review process for AI output.
Time spent on boilerplate: estimate the share of work on tasks where agents are strong (scaffolding, tests, docs). If it's less than 20% of the team's working time, ROI will be low.
A good benchmark for a greenfield project: if the team doesn't see at least a 15–20% reduction in time-to-PR within the first month, either the agents aren't set up correctly or the use case isn't a good fit.
For more on measuring overall AI project ROI, see the article ROI of AI Projects.
Security risks — not academic
Blind trust is the primary risk
This is not a theoretical problem. In agentic mode, where the agent changes multiple files at once, the reviewer must consciously decide to inspect every change — even when the diff looks trivial. A research-backed phenomenon: developers reduce their attention during review when examining AI-assisted code. Code that "the agent wrote" receives a softer look than code written by a colleague.
The company standard must be explicit: AI-generated code goes through the same review as human code. Not a lighter one.
Prompt injection in an agentic context
Claude Code and similar tools can read files from the repository as part of their context. If the repository contains files that came from an external source — client documents, downloaded configurations — those files may contain instructions for the model. This is a prompt injection attack. The OWASP LLM Top 10 (2025 edition) ranks prompt injection as the number one risk.
Rule: agents should not have access to files from untrusted sources without a conscious decision by an engineer.
Sensitive data leakage through context
When an agent reads a codebase to answer a question, it sends part of that code to the cloud (Anthropic API, OpenAI API). If the codebase contains credentials, API keys, or other sensitive information directly in the code — and many legacy codebases genuinely do — the agent will transmit them. This is a compliance problem in regulated industries.
The solution is a combination: .gitignore + .env discipline, local models for on-premises environments, and the enterprise tier of cloud tools with contractual data-training guarantees for cloud environments.
We cover the topic of feeding company data into LLMs in more detail in GDPR and LLMs over Company Data.
Setting up for your team — practical steps
Not every developer on the team will use agents equally effectively. Here is a minimal framework that works in practice:
- 1.Define permitted use cases: for example, boilerplate, tests, documentation, and code exploration are permitted zones. Authentication, cryptography, SQL, and external API calls go through mandatory review with no shortcuts.
- 1.Tag AI-generated code in PRs: a simple tag in the PR description (
AI-assisted: yes/no) creates visibility and enables measurement of rejection rates.
- 1.Reviewers don't skip work: an explicit rule — for an AI PR, the reviewer adds at least one comment. This forces genuine reading.
- 1.Track costs: agentic mode consumes significantly more tokens than inline autocomplete. Teams that switch from Copilot to Claude Code without monitoring can receive a surprising monthly bill.
- 1.Regular retrospectives: a 30-minute monthly review — what agents helped with, what they didn't, what changed.
Grounding agents in context
Coding agents are a specialised form of AI agent. If the team is considering going further — for example, an agent that not only generates code but also deploys, tests, and monitors results on its own — that's a different category of complexity and risk. We cover architectures for such systems in AI Agent Architectures; for specifics on HITL and approval gates, see Human-in-the-Loop for Agents.
Frequently asked questions
Is a coding agent worthwhile for a five-person team?
Yes, if the team works on greenfield projects or on products with decent test coverage. For teams that spend most of their time on legacy code without tests, ROI is lower and risk is higher. We recommend starting with inline autocomplete (Copilot or Cursor tab) and moving to agentic mode only once the team has a clear review workflow in place.
Can I use Claude Code in a regulated industry (healthcare, financial services)?
It depends on what the agent reads. If it's working on code that contains no sensitive data and no access to it, the risk is manageable. If the agent would have access to systems holding health or financial data, you need either an enterprise tier with contractual guarantees or a local deployment. This is a decision that requires legal and compliance analysis, not just a technical one.
How do I prevent an agent from committing something dangerous?
Multi-layered protection: pre-commit hooks to detect sensitive data, code review without exceptions, and with Claude Code an explicit restriction on which directories it can read. Agentic mode should not have access to .env files, cryptographic keys, or production configurations.
Will coding agents replace junior developers?
No — they change the nature of junior work. The boilerplate that previously trained juniors to understand patterns is now generated by agents. Junior developers need to focus more on review, on understanding why code works, and on testing. Teams that haven't updated their onboarding for the AI era risk producing juniors who can't read code they didn't generate themselves.
How do I measure whether the agent is genuinely helping or just creating an illusion of productivity?
The key metric is time-to-merged-PR on tasks of comparable scope, before and after deploying the agent. A second signal: rejection rate at code review. If the agent sped up writing but slowed down review, the net effect is zero or negative.
*MP Industrial Solutions helps companies set up workflows for AI-assisted development — from tool selection and security policies through to measuring the impact. If you're considering deploying coding agents or local LLMs for your development team, we'd be glad to walk through your specific context in a consultation.*
