The $100K Consulting Question Our AI Agent Asked for Free

We left our AI agent running during a standup, one of those fifteen-minute morning calls where Li and I argue about which onboarding bug to fix before the next customer demo at 11am. By the time we'd wrapped up, it had a question waiting for us in the terminal:

“What percentage of Li’s current engineering time is actually building the taste layer versus the plumbing around it?”

Li stared at the screen for a while and then said something I wasn't expecting: “I haven't felt goosebumps like this since the first time I used GPT-4.” He wasn't being dramatic, either. He messaged me about it again that evening, and when I saw him the next morning he was still turning it over.

No one at our startup could have produced that question. We're definitely not paying a hundred grand for a consultant in a suit to come poke around our Notion workspace for a few weeks, and even if we were, they'd never get this deep into what we actually care about. They'd get our org chart and our monthly revenue. They would not get the three months of call transcripts where customers keep telling us the feel of our agent is what sold them, or the Slack thread where Li and I debated for two days about whether to let clients filter candidates by gender, or frankly any of the fourteen separate conversations we've had trying to pin down what “taste” even means when you're building an AI recruiter.

The people who actually hold all that context (us, the founders) are too deep in the day to day to sit with it properly. Maybe if we'd locked ourselves in a room on a Sunday night and forced ourselves to just sit there and think for several hours, one of us would've eventually arrived at that question. But we're not doing that, because there's always another onboarding to prep or another bug in the candidate ranking pipeline. No meditation needed, no ayahuasca. Five minutes of compute, outputted in the command line.

The markdown graveyard

Six months before that standup, my agent couldn't do anything remotely like this. It could summarize my Gmail and nag me about calendar events, which is about what you'd expect from a Claude Code subscription with Notion and Gmail MCP servers plugged in. Anything that required actual strategic judgment? You'd get a confident-sounding paragraph built entirely on hallucinated context. It once told me our biggest competitor was a company that didn't exist.

The root cause, I eventually figured out, was how I was storing context. Everything lived in flat markdown, in a single file called memory.md. Append-only. No structure whatsoever. Within about a week, that file was a complete disaster. A half-formed note I'd dictated about some Slack bug sitting right next to the pricing structure we'd hammered out with our first real paying customer. The agent treated both with equal importance because there was literally nothing to tell them apart.

Li had been building his own agent setup in parallel (he's our CTO, builds everything) and had crashed into the exact same wall. “I'd connect Claude to Gmail, Slack, Notion, Calendar, and then it just gets confused because it doesn't know where to pull relevant context,” he told me during one of our standups. “The minute you stop hand-feeding it, it starts hallucinating and becomes completely useless.”

So I went looking at what other people had tried. The most sophisticated open-source approach I could find was OpenClaw, which uses a two-tier system: important information gets persisted into a long-term store, and everything else has time decay built in where recent stuff is weighted higher and older context gradually fades. Clearly better than dumping everything into one markdown file. But after studying it for a while, I realized the issue was more fundamental: there's zero conceptual relationship between those two tiers. A critical pricing decision from two months back and a throwaway Slack message from yesterday? Same structural treatment. The only difference is the timestamp.

What I actually wanted, and this took me embarrassingly long to articulate even to myself, was an ontology. Some kind of map of what we should actually care about as a company. But I specifically did not want to write that map by hand, because I knew two things: first, that I'd get it wrong (I always think I know what matters until I don't), and second, that the whole value of the system would come from it figuring out what matters by watching what we actually spend our time talking about and deciding.

Databases, not files

I ended up using Notion, and not for any particularly clever reason. It's what I was already paying for, and it has the two properties that actually matter here: databases with columns you can create programmatically, and pages that both humans and agents can read through MCP.

Here's what I did. I created four completely empty databases. I'm talking blank: a title column and a created-at timestamp. That's it. Zero properties, zero views, zero templates. Then I told the agent, in its system prompt, that it had permission to create whatever properties, taxonomies, and relationships it needed inside those databases.

The hierarchy ended up looking like this:

Themes at the top. Strategic questions the company keeps orbiting around. Things like “headhunting versus inbound applicants” or “should we charge placement fees or subscriptions.” I want to be clear: I didn't type a single one of these in. The agent derived them by reading my email threads, the transcripts from every customer call I had in Granola, and weeks of Slack history. When it spots something relevant to an existing theme (say, a newsletter mentioning that a competitor just switched to subscription pricing) it appends a timestamped signal and regenerates a synthesis paragraph at the top of the theme page. So each theme ends up as a living argument that gets smarter over time, not a sticky note that yellows.

Decisions one level down. I actually built this database first, for a reason that might sound strange: I needed the agent to remember its own decisions. In a flat markdown setup, there's no versioning. The agent literally couldn't tell the difference between a choice it made last Tuesday and something that had been overridden on Thursday. But once I had that working, I realized the bigger win was capturing our decisions, the founders' decisions. One of our customers asked whether our agent could exclusively search for female engineers. That's not a yes-or-no question. We spent real time debating it, and we reached a conclusion that had actual reasoning behind it. Normally that conclusion lives and dies in a Slack thread. Now it's a Notion record with the context, the logic, and a relation back to the themes it connects to.

Then Ideas and Tasks below that, where things actually get built and shipped. Context percolates upward through the hierarchy. Concrete work cascades down.

The part that was hardest to commit to: within each of those four databases, the agent does everything. It invents properties. It creates status taxonomies. It builds relationships between records. A few weeks in, it decided the Themes database needed a “Signal Count” property, created it, and then went back through every single existing theme to fill it in retroactively. The shape of the data is genuinely different from month to month.

I designed the architecture. The agent furnishes the rooms.

Why this produces questions, not just answers

Most of the agent setups I've seen (and most of the ones I built before landing on this) are optimized for doing stuff. Summarize that email. Draft a reply. Pull the action items out of this call transcript and put them in Notion. Grade the agent on throughput: how many tasks did it knock out today? That's genuinely useful, and I still run dozens of those pipelines every morning when my daily brief gets assembled. But there's a ceiling on it. An agent that only does things will never flag the thing you should actually be worried about.

What I wasn't expecting from this architecture, honestly, is that it would start generating strategic insight unprompted. But that's what happens when you have an agent continuously funneling raw context into structured databases that have real relationships between them. It builds up a picture of the company that neither Li nor I hold individually at any given moment. I know what my customer calls were about this week. Li knows which pull requests are stuck. Neither of us is tracking how both of those things connect to the competitive signals from last month, the pricing decisions we made in January, and the recurring theme about onboarding quality that keeps surfacing across seemingly unrelated conversations. The agent sees all of that simultaneously, because it's the one writing it all down.

So I built a thing I called the Reflect skill. The instruction is almost comically simple: take everything that was appended to the context databases in the last week, reason over it against everything you know about the founders and the company, and produce one question, shortest possible wording, about what you believe is the single most important thing we should be thinking about right now.

That question about taste versus plumbing hit both of us so hard because of what the agent had been observing from two completely different angles. On one side: Li's Notion board was filling up with cards for data vendor integrations and infrastructure migrations. Necessary work, but not the stuff that makes a customer say “holy shit” during a demo. On the other side: I'd had maybe eight customer conversations that month where the thing they kept coming back to was how the quality and personality of the agent experience is what made them choose us over competitors with slicker landing pages. The agent saw the gap between where our engineering time was going and what every customer kept saying actually mattered. Nobody told it to compare those two things. It just did, because the data was there in the structured context and the relationships between the databases made it possible.

Most off-the-shelf AI products are genuinely terrible at this kind of synthesis. Summarization? They're fine at that. But stitching together threads from completely different domains (engineering velocity, customer sentiment, strategic positioning) and producing a pointed question? That requires context that compounds over time instead of evaporating. That's what this architecture actually gives you: every email processed, every call transcribed, every decision recorded makes the agent's next question sharper than the last one.

How to build your own

This runs on Claude Code with Notion and a handful of MCP servers, but the pattern doesn't require our specific stack. It works with Obsidian, Supabase, Airtable, really anything that gives you database-like tables where an agent can create columns and properties on the fly.

You need three things.

First, create your context hierarchy. Four blank databases, four levels of abstraction. I used Themes, Decisions, Ideas, and Tasks, but you should call them whatever maps to how you actually think about your company. The specific labels matter way less than the fact that they represent genuinely different altitudes of information, from strategic questions that persist for months all the way down to concrete work items that get shipped this sprint.

Start every database completely blank. Title field, created date, nothing else. Then (and this is the part that feels wrong at first) give the agent explicit written permission in your system prompt to create any properties, status fields, tags, and relationships it decides it needs. You will feel the urge to pre-design the schema. Resist it.

Second, connect your context sources. The agent needs a continuous feed of raw material without you having to think about it. Our inputs are Granola for call transcripts, Gmail for email, Slack for internal chatter, and Notion itself for the databases it's already writing to, all connected through MCP servers. Your sources will obviously look different. The non-negotiable part is that it should be automatic. If you have to remember to copy-paste a call summary for the agent to process it, you'll stop doing it within a week. I certainly would.

Third, build a Reflect skill. Easiest piece to implement and by far the highest-leverage one. The prompt I use is roughly:

Look at everything that was appended to the context databases this week. Consider it against the full picture of the company: all active themes, recent decisions, open questions you've identified. Produce a single question, as few words as you can manage, about what you believe is the most important thing this business should be thinking about right now.

One question only. Not a summary, not a daily digest, not a bullet list of “areas to consider.” A question, because questions force the agent to commit to a specific point of view rather than hedging across a dozen safe observations. And they force you to engage with it as a thought, not skim it as a report.

Why cap it at one? Because if you ask for ten questions you'll get a document. You'll read the first three, skim the rest, and file it somewhere you'll never look again. A single question is a provocation, the kind of thing you end up arguing about in the shower twenty minutes later. The constraint is what makes it valuable. And the best questions are confronting. They put words to the semi-formed worry that's been rattling around in the back of your head for weeks, the one you keep shoving aside because there's always a more pressing fire to deal with first.

If you're a founder, I guarantee there's a question you need to be asked that you're currently suppressing. This is how to bring that question to the surface.