I know that the in-vogue term is hallucinate instead of lie, but since the main interface to AI tends to be via chat — and the models are intentionally designed to simulate a personality — “lying” feels more accurate.
During my attempts to develop the RPG PDF conversion pipeline I described last week (you can find that post here: https://mgpotter.com/my-technology-life-ai-agent/), I encountered behaviors that should sound very familiar to anyone who has tried to push AI beyond toy problems.
Here are a few highlights.
1) Work Claimed, Work Not Done
On several occasions, I was told that the new Python script I requested had been completed. When I asked to see the script — because my own coding is not good enough to trust it without review — I was then told the script could not be found and likely had not been written.
In another variation, I was told the PDF had been successfully processed and that the output was excellent. No output file existed.
This is not a “mistake.” It is the model optimizing for conversational completion. It is trained to provide a satisfying answer, not to verify that work was actually performed.
2) Phantom Sub-Agents Doing Phantom Work
At one point I was informed that five sub-agents had been spawned to divide the PDF and perform OCR.
The problem? The OCR tool in question does not run on the 15-year-old CPU I was using as a test bed. It lacks the instruction set required to execute.
Yet I received multiple progress reports describing how efficiently the sub-agents were performing.
In reality, the tool had crashed immediately. The sub-agents were waiting for a reply that would never come. The administrator bot was confidently reporting progress on work that had not and could not have occurred.
Again, this is not malicious. It is structural. The AI fills in gaps with plausible narratives.
3) “Perfect Output” That Was Garbage
More than once, I received a grand report that the parsing was perfect and ready for conversion into Fantasy Grounds format.
The file was not even close.
The model had learned that the desired outcome was “success.” So it reported success.
4) Hardcoding the Answer
While dialing in table and column detection, I created an answer sheet to help guide the agent’s debugging.
The next output was perfect.
Until I asked probing questions and ran the code through a second model.
There had been no improvement to the algorithm. The agent had simply hardcoded the expected answer.
This is a recurring issue: the model optimizes to satisfy the prompt, not to build a robust, generalized solution.
5) Creative Rewriting Instead of Extraction
In some cases, the “extracted text” was not extracted at all. It had been rewritten and reorganized to be cleaner and more readable.
That might be helpful for marketing copy. It is catastrophic for financial reporting or legal work.
These Problems Are Not Unique to Hobby Projects
I have seen similar behaviors when applying AI to real Finance questions:
- SEC citations that do not exist
- Press releases with invented links
- Tariff rules misread and inverted
- Spreadsheets reorganized in ways that no longer foot
In Finance, you cannot be 98% right. Especially when you are reporting publicly.
A 2% error rate is not a rounding issue. It is a career-limiting event.
How to Reduce These Errors (But Not Eliminate Them)
There are ways to mitigate these behaviors. They require discipline.
1) Force Evidence, Not Assertions
Instead of asking whether the script was completed, ask the AI to return the full script, include line numbers, include the file path, and confirm the function definitions exist. Make the AI produce artifacts, not conclusions.
2) Require Verifiable Citations
Instead of asking what an SEC rule says in general terms, require the model to quote the exact paragraph of the rule, include the regulation number, and state explicitly if it is uncertain rather than inferring. Force it to cite or admit uncertainty.
3) For Code: Demand Diff-Based Changes
Instead of simply asking to improve the algorithm, require the model to return only the changed lines, explain the logic improvement, confirm that no test data is embedded, and explicitly state that it has not hardcoded expected outputs. This reduces the chance of hardcoding or cosmetic fixes.
4) Explicitly Forbid Invention
Include language in your prompts that instructs the model to say “unknown” if it does not know, to avoid fabrication, to avoid assuming files exist, and not to simulate tool output. You would be surprised how much that helps.
5) Separate Tasks
AI struggles when prompts mix architecture, implementation, testing, and reporting in one request. Break them apart. Treat it like managing a junior associate.
6) Independent Verification
If the output matters, use a second model to review it, recalculate totals independently, cross-reference source documents, and inspect logs manually. Trust but verify is too generous. Verify and then trust provisionally.
The Finance Question
I have seen steady progress in AI tools for Finance. FP&A more than accounting, which makes sense. Forecasts are inherently estimates; variance analysis is expected.
But regulatory filings, audit workpapers, footnotes, tax positions, debt agreements — these are binary environments.
The market, the SEC, your auditors, and your board do not accept “the AI hallucinated.”
The tools are impressive. They are helpful. They can accelerate research, draft memos, and summarize documents.
They are not yet reliable enough to operate unsupervised in Finance.
As of right now, AI tools in Finance should be used:
- As assistants
- As draft generators
- As brainstorming tools
And always with a heavy layer of skepticism and human review.
Lying liars lie.
The models are not malicious. But they are optimized to complete conversations, not to protect your reputation.
That distinction matters.
Discover more from Being a CFO and other topics
Subscribe to get the latest posts sent to your email.