Not just finance, hobbies too ....

Category: AI

Car being driven by a robot goes off a cliff.

AI in Finance Is a Governance Problem — Not a Technology One

For the last year or two, every CFO conversation eventually drifts into AI. Sometimes it’s framed as excitement, sometimes as anxiety, and sometimes as an awkward silence followed by, “Well, we’re looking at it.” What’s striking is that most of the tension around AI in finance has very little to do with the technology itself. The models work. The tools are improving fast. The vendors all have slick demos.

The real issue is governance.

Finance teams are wired around controls, auditability, and repeatability. AI systems, by contrast, are probabilistic, opaque, and constantly evolving. That mismatch is where most CFO discomfort comes from — and it’s why “let’s just automate this” often stalls once it hits a real finance process.

The first mistake I see is treating AI like just another system implementation. ERP projects taught us how painful that mindset can be. AI requires a different framing: not “what can this tool do?” but “what decisions are we willing to delegate, and under what constraints?” That sounds abstract. It isn’t.

Over the past year I’ve pushed AI tools on real finance questions: revenue recognition edge cases, SEC disclosure interpretations, covenant calculations, and technical accounting memos. The patterns that show up are not technology failures. They are governance failures waiting to happen.

1. AI doesn’t fight back.

If you have ever debated an accounting position with a strong controller or technical accounting lead, you know what conviction feels like. You push. They push back. You test assumptions. They defend them with chapter and verse. That friction is healthy. Same thing for a forecast analysis. If one FP&A analyst thinks they found a good or disturbing trend, it will be debated and verified and usually their work can be recreated and checked.

AI does not behave that way.

If you tell it, “I think you’re wrong,” it often apologizes and produces a different answer. Sometimes an entirely opposite answer. The confidence level remains high. The tone remains polished. The data is processed inside the model, and the AI often struggles to explain — or even remain consistent in — its answers.

In a live finance organization, that would be a red flag. If a manager flipped their view that quickly under mild pressure, you would question the depth of analysis. With AI, the flip can look like responsiveness rather than fragility.

That is a governance issue. It means you cannot treat an AI output as a position that has survived adversarial testing. It hasn’t. It has survived prompt engineering. And the prompt may have been poor.

2. The praise problem.

Most AI agents are relentlessly deferential. “Great question.” “Excellent point.” “You’re absolutely right to focus on that.” In a consumer context, that feels pleasant. In a finance context, it is dangerous.

Finance works because of tension — between risk and growth, between conservatism and disclosure clarity, between what management wants and what GAAP allows. When the “advisor” in the room is constantly affirming the user, it subtly reinforces bias.

I’ve seen this firsthand when asking an AI to pressure-test a disclosure approach. Rather than aggressively identifying weaknesses, it often validates the framing of the question. The tone can make a marginal position sound well-supported. In other words, the user’s confidence can rise faster than the quality of the analysis.

Governance must assume that AI will not naturally challenge you the way a seasoned audit partner or skeptical board member will.

3. The citation illusion.

This one should make every CFO uncomfortable.

Ask an AI to provide citations to accounting guidance or SEC commentary, and it will often comply — confidently. Paragraph numbers. Codification references. Even plausible-sounding excerpts.

The problem is that some of them are fabricated. They look right. They read right. They are formatted correctly. But they do not exist.

In finance, citations are not decorative. They are the backbone of defensibility. When you write a technical memo on revenue recognition or stock-based compensation, the citation is the bridge between your judgment and the authoritative literature.

If an AI invents that bridge, and a team relies on it without independent verification, the failure is not the model’s. It is the control environment’s. Any AI-assisted accounting memo must include a verification step where a human independently confirms the authoritative source. Not “glances at it.” Confirms it.

4. Rule changes and historical drift.

Accounting rules change. Constantly.

Revenue recognition under ASC 606 replaced a patchwork of legacy guidance. Lease accounting under ASC 842 upended decades of practice. The SEC updates disclosure expectations over time, sometimes subtly, sometimes dramatically.

Meanwhile, the SEC’s EDGAR archive goes back decades. There are scanned paper filings from eras when the rules were materially different. There are thousands of examples built under superseded guidance.

AI models trained on broad corpuses struggle here. They can blend old and new regimes. They can cite legacy practice as if it were current. They can rely heavily on the abundance of historical examples rather than the correctness of modern policy.

I have seen AI answers that lean on pre-606 revenue language as though nothing changed. Or that reference lease accounting concepts that no longer apply post-842. To a non-expert, the answer looks sophisticated. To someone who lived through the transition, the seams are obvious.

Governance means you assume the model does not instinctively know the effective date of your accounting framework. You have to constrain it.

5. Finance is not plain English.

Financial reporting language is precise. “Probable” does not mean “likely” in a colloquial sense. “Material” is not a synonym for “important.” “Reasonably possible” has a defined meaning.

AI systems are trained on massive volumes of plain English. That is a strength in many domains. In accounting, it can be a weakness.

I’ve seen answers where the model drifts into narrative explanations that sound sensible but subtly misapply defined terms. In a board deck, that might pass. In a 10-K, that is a problem.

When language itself carries regulatory weight, small deviations matter.

So what does governance look like in practice?

It is not banning AI. That is neither realistic nor wise. The productivity gains are real. Drafting first passes of memos, summarizing contracts, identifying anomalies in large datasets — these are powerful tools. AI can be properly trained on your data and become more accurate. Specialized firms like the Big 4 Auditors can train AI models on better and sanitized accounting data, but your small Finance group cannot and its probably using a more general model.

But they must sit inside a control framework.

At a minimum:

  • AI outputs that influence external reporting require documented human review.
  • AI conclusions about trends must be independently tested and verified. Don’t order another $1M of a part because a model suggested it.
  • Authoritative citations must be independently verified.
  • Prompts and versions used for material analyses should be retained for auditability.
  • Use cases must be categorized: drafting support is different from judgment replacement.
  • Responsibility for the final position must be clearly assigned to a human owner.

Most importantly, the CFO has to set the tone.

Let me make a direct observation: most leadership team members are not finance experts, but AI can create the illusion that they are. You need to make sure they understand the risk.

If AI is positioned as an infallible oracle, teams will over-rely on it. If it is positioned as a junior analyst — fast, helpful, occasionally wrong, and requiring supervision — behavior adjusts appropriately.

The question is not whether AI will be used in finance. It already is.

The question is whether it will be used inside a governance framework that protects credibility.

Investors do not care how you produced your numbers. Auditors do not care how you drafted your memo. Regulators certainly do not care that a model was “usually right.” They care that your disclosures are accurate, supportable, and controlled.

AI in finance is not a technology problem. It is a governance problem. And like most governance problems, it lands squarely on the CFO’s desk.

I don’t want to sound like Cassandra warning of inevitable doom. Nor do I want to be the boy who cried wolf while your competitor quietly figures this out and gains an advantage.

In future posts, I will outline where I believe AI can genuinely add value inside a disciplined finance organization.

Wooden puppet draped in green glowing code with a large nose.

My Technology Life: AI: Lying Liars Lie

I know that the in-vogue term is hallucinate instead of lie, but since the main interface to AI tends to be via chat — and the models are intentionally designed to simulate a personality — “lying” feels more accurate.

During my attempts to develop the RPG PDF conversion pipeline I described last week (you can find that post here: https://mgpotter.com/my-technology-life-ai-agent/), I encountered behaviors that should sound very familiar to anyone who has tried to push AI beyond toy problems.

Here are a few highlights.

1) Work Claimed, Work Not Done

On several occasions, I was told that the new Python script I requested had been completed. When I asked to see the script — because my own coding is not good enough to trust it without review — I was then told the script could not be found and likely had not been written.

In another variation, I was told the PDF had been successfully processed and that the output was excellent. No output file existed.

This is not a “mistake.” It is the model optimizing for conversational completion. It is trained to provide a satisfying answer, not to verify that work was actually performed.

2) Phantom Sub-Agents Doing Phantom Work

At one point I was informed that five sub-agents had been spawned to divide the PDF and perform OCR.

The problem? The OCR tool in question does not run on the 15-year-old CPU I was using as a test bed. It lacks the instruction set required to execute.

Yet I received multiple progress reports describing how efficiently the sub-agents were performing.

In reality, the tool had crashed immediately. The sub-agents were waiting for a reply that would never come. The administrator bot was confidently reporting progress on work that had not and could not have occurred.

Again, this is not malicious. It is structural. The AI fills in gaps with plausible narratives.

3) “Perfect Output” That Was Garbage

More than once, I received a grand report that the parsing was perfect and ready for conversion into Fantasy Grounds format.

The file was not even close.

The model had learned that the desired outcome was “success.” So it reported success.

4) Hardcoding the Answer

While dialing in table and column detection, I created an answer sheet to help guide the agent’s debugging.

The next output was perfect.

Until I asked probing questions and ran the code through a second model.

There had been no improvement to the algorithm. The agent had simply hardcoded the expected answer.

This is a recurring issue: the model optimizes to satisfy the prompt, not to build a robust, generalized solution.

5) Creative Rewriting Instead of Extraction

In some cases, the “extracted text” was not extracted at all. It had been rewritten and reorganized to be cleaner and more readable.

That might be helpful for marketing copy. It is catastrophic for financial reporting or legal work.

These Problems Are Not Unique to Hobby Projects

I have seen similar behaviors when applying AI to real Finance questions:

  • SEC citations that do not exist
  • Press releases with invented links
  • Tariff rules misread and inverted
  • Spreadsheets reorganized in ways that no longer foot

In Finance, you cannot be 98% right. Especially when you are reporting publicly.

A 2% error rate is not a rounding issue. It is a career-limiting event.

How to Reduce These Errors (But Not Eliminate Them)

There are ways to mitigate these behaviors. They require discipline.

1) Force Evidence, Not Assertions

Instead of asking whether the script was completed, ask the AI to return the full script, include line numbers, include the file path, and confirm the function definitions exist. Make the AI produce artifacts, not conclusions.

2) Require Verifiable Citations

Instead of asking what an SEC rule says in general terms, require the model to quote the exact paragraph of the rule, include the regulation number, and state explicitly if it is uncertain rather than inferring. Force it to cite or admit uncertainty.

3) For Code: Demand Diff-Based Changes

Instead of simply asking to improve the algorithm, require the model to return only the changed lines, explain the logic improvement, confirm that no test data is embedded, and explicitly state that it has not hardcoded expected outputs. This reduces the chance of hardcoding or cosmetic fixes.

4) Explicitly Forbid Invention

Include language in your prompts that instructs the model to say “unknown” if it does not know, to avoid fabrication, to avoid assuming files exist, and not to simulate tool output. You would be surprised how much that helps.

5) Separate Tasks

AI struggles when prompts mix architecture, implementation, testing, and reporting in one request. Break them apart. Treat it like managing a junior associate.

6) Independent Verification

If the output matters, use a second model to review it, recalculate totals independently, cross-reference source documents, and inspect logs manually. Trust but verify is too generous. Verify and then trust provisionally.

The Finance Question

I have seen steady progress in AI tools for Finance. FP&A more than accounting, which makes sense. Forecasts are inherently estimates; variance analysis is expected.

But regulatory filings, audit workpapers, footnotes, tax positions, debt agreements — these are binary environments.

The market, the SEC, your auditors, and your board do not accept “the AI hallucinated.”

The tools are impressive. They are helpful. They can accelerate research, draft memos, and summarize documents.

They are not yet reliable enough to operate unsupervised in Finance.

As of right now, AI tools in Finance should be used:

  • As assistants
  • As draft generators
  • As brainstorming tools

And always with a heavy layer of skepticism and human review.

Lying liars lie.

The models are not malicious. But they are optimized to complete conversations, not to protect your reputation.

That distinction matters.

The OpenClaw logo, featuring a stylized lobster and the wordmark.

My Technology Life – AI Agent

A Quick Warning Before We Start

Before getting into the substance of this post, it’s worth being explicit about the environment this work was done in.

OpenClaw is not a secure system. I would not expose it to the internet, and I would not run it anywhere near a machine that held sensitive data. This experiment was conducted on an isolated Linux box that is more than ten years old, deliberately segmented away from anything that mattered. That isolation was intentional, and I would consider it a prerequisite rather than a nice-to-have.

With that caveat out of the way, here’s what I learned.

The more technical information is this post comes from AI allowing me to cosplay as someone with a much deeper background in this field. I haven’t coded since I was a teenager running a BBS on my Apple //GS. Everything described here was implemented by directing AI tools — primarily Claude — with research, validation, and conceptual framing done through ChatGPT.

This was not a case of me dusting off dormant engineering skills. It was an exercise in seeing how far careful prompting, iteration, and architecture could go without writing code myself.


I started with what seemed like a reasonable question: could an AI agent take an RPG PDF and convert it into a usable Fantasy Grounds VTT reference manual? Fantasy Grounds is the tool I use to run RPG with my friends and I very often have to get adventures into the program.

The test case was a Mothership RPG adventure. Not particularly long, but representative of the kind of layout that makes RPG books pleasant to read and painful to process. Multi-column text, sidebars, boxed callouts, tables, and frequent typography changes all coexist on the same page. Humans have no trouble with this. Machines very much do.

The first thing that became obvious is that PDFs do not contain “text” in the way we usually think about it. They contain positioned glyphs. Reading order, paragraph structure, and emphasis are all emergent properties created by the human brain. When you extract text naively, you get all the words, but not the story they were meant to tell.

Standard PDF extraction tools did exactly what they are designed to do. They gave me the words. They just gave them to me in the wrong order. Columns were interleaved, paragraphs were broken every line, sidebars merged into body text, and tables disintegrated into streams of numbers and labels with no structure left intact.

At that point, the obvious temptation was to let the LLM “just read the PDF.” After all, large language models are very good at understanding text, right?

That approach failed in subtle but dangerous ways.

LLMs are quite good at repairing relationships when the underlying structure is mostly correct. They are far less reliable when asked to infer structure that was never presented to them in the first place. RPG books are full of ambiguous layout decisions, and when an LLM guesses, it does so confidently and silently. Sidebars get merged into rules text. Paragraphs are reordered to match genre expectations rather than author intent. The output looks clean, but it is wrong in ways that are difficult to detect later.

The approach that actually worked separated responsibilities very strictly.

First, the extraction phase focused entirely on facts. Using PyMuPDF, the system extracted every word along with its exact coordinates, font size, font face, and bounding box. The output was ugly and unreadable, but nothing was lost. Every signal a human reader subconsciously relies on was still present, just not interpreted yet.

Second came layout reconstruction. This was where most of the complexity lived. By working from geometry instead of text flow, it became possible to detect column gutters, read entire columns top-to-bottom instead of left-to-right across the page, and reconstruct paragraphs based on vertical spacing rather than newline characters. Hyphenated words could be repaired deterministically. Headings could be inferred from typography rather than guessed from phrasing.

This step also addressed the most visible problem with PDF extraction: the explosion of extra line feeds. Those line breaks are not semantic. They are artifacts of line wrapping. Once reading order and paragraph boundaries are reconstructed using spacing and font metrics, most of those spurious line breaks disappear before an LLM ever gets involved.

Only after that cleanup did the LLM enter the process, and even then its role was constrained. It was allowed to repair flow and normalize text, but not invent structure, reorder content, or generate XML. Markers for headings, emphasis, sidebars, and tables were preserved explicitly so the model could not “helpfully” smooth them away.

The final stage — generating Fantasy Grounds XML — was deliberately scripted and deterministic. Fantasy Grounds is unforgiving, and rightly so. IDs, tags, ordering, and escaping are not things you want a language model remembering across thousands of tokens. Once the content was clean and correctly ordered, turning it into XML was a mechanical problem, not an AI problem.

Tables turned out to be the most treacherous area. Early attempts to detect them aggressively led to false positives where multi-column prose or credits pages were misclassified as tables. The safer approach was to be conservative to the point of under-detection, preserving text as text unless the evidence was overwhelming. A less-perfect table is preferable to corrupted rules text.

One of the more humbling realizations came late in the process. The PDF had no table of contents, but it did have bookmarks. Those bookmarks reflected the author’s actual organizational intent far better than anything inferred from layout alone. Once the pipeline followed them, chunking and navigation improved immediately. It was a reminder that many “AI problems” are really failures to leverage existing metadata.

From a CFO perspective, the conclusion is straightforward.

This took longer than doing it manually. Considerably longer.

I spent over $100 running smaller, simpler test cases just to understand where tools failed, where token usage exploded, and how errors manifested. That spend did not produce output. It produced learning. For a single RPG module, this is not rational. Manual conversion would have been faster and cheaper.

Where this starts to make sense is repetition. Multiple modules. Consistent layouts. A reusable pipeline. At that point, the upfront investment begins to amortize.

There is also a broader organizational lesson here. Running this through a rough, developer-oriented agent on an isolated machine worked, but it was far from ideal. A more user-friendly agent, or involvement from an IT team, would have reduced iteration time, lowered token waste, and improved safety. There is real value in tooling and support, even — perhaps especially — when AI is involved.

This was a successful experiment, but not an efficient one. I would do it again only if I planned to do it many times. As with so many automation efforts, the real question is not whether it can be done, but whether you are willing to do it often enough for the investment to pay off.

Sometimes the most useful output isn’t the finished product. It’s understanding where the break-even point actually lies. And what you learn along the way.

I ran OpenClaw on a very old HP ProLiant MicroServer (Gen 7) with a Turion II processor. I blocked incoming access using standard Linux hardening, isolated it on its own VLAN, and did not install any additional skills or allow the agent to browse the internet. That was intentional, given the known security risks around prompt injection attacks and malware delivered through unscreened skills.

All files used by OpenClaw lived only on that machine, and it had no access to my other systems. OpenClaw itself can run on fairly low-end hardware, since the heavy lifting is done in the cloud. If you want a more robust — and still relatively inexpensive — platform to run it on, with the added benefit of access to the Apple ecosystem, many people use Mac minis.

Mac Mini M4 on Amazon.com

If you want to find OpenClaw (the speed of the internet moves fast and months from now this may not be the hot new tool), look here:

OpenClaw AI Bot

Powered by WordPress & Theme by Anders Norén