Building an AI Agent That Actually Understands Linear Time

Most people assume LLMs can handle basic things like time.

"Tomorrow."
"Today."
"Next week."

Simple, right?

Wrong.

Our AI recruiter almost booked a candidate into a meeting that literally couldn't happen. We fixed it by giving the agent a manager.

A real one. Just also an LLM.

The tiny bug that would have wasted everyone's time

Here's what actually happened.

Sunday in Slack:

Hiring manager: "Book the interview, but I'm no longer available tomorrow."

Monday in email:

Candidate: "I've booked today!"

Most vanilla LLM agents would say:

"Great! Interview confirmed."

But:

"Tomorrow" from Sunday
= "Today" on Monday
= Same day

The hiring manager is not available.

The candidate thinks the interview's on.

The hiring manager thinks it's off.

Your calendar thinks everything's fine.

That's how you get no-shows, awkward apologies, and "Sorry, something came up" nonsense.

Our agent made that exact mistake.

At first.

LLMs are surprisingly bad at time

This isn't just "AI being dumb."

There's actual research showing how bad this is.

The ChronoSense benchmark ("Exploring Temporal Understanding in Large Language Models", 2025) tests pure temporal reasoning: questions like "Did Event A happen before Event B?" with abstract events (no world knowledge crutches). On that benchmark:

GPT-4o, the best model they tested, only scores 0.78 accuracy on basic before/after questions
That's wrong 22% of the time on questions a normal human would get nearly 100% right
Smaller open models do far worse; many hover near random guessing

You can read the paper here: https://arxiv.org/abs/2501.03040

It gets uglier when you ask for date math:

"This started in 1853 and lasted 24 years, when did it end?"
"If something lasted 3 years and ended in 2021, when did it start?"

Without forcing "let's think step by step", even strong models behave like they're basically guessing.

Why?

Big picture:

LLMs don't have a built-in sense of time. They only see token order, not event order, so "before" and "after" aren't native concepts.
The training corpus is temporally chaotic. "Tomorrow" in a 2014 blog post is still "tomorrow" in 2025 training data. Nothing is anchored.
They lean on memorization, not true temporal logic. When you strip away real event names and just use abstract events, performance falls off a cliff.

So when your agent sees "tomorrow" in one message and "today" in another, days apart?

It doesn't really know what that means in linear time.

It's just playing autocomplete with vibes.

Our first recruiter failed this exact test

Version one of our recruiter behaved exactly like a vanilla LLM:

Candidate books "today" via Calendly
Agent sees "Event booked"
Agent thinks: "Success!"
It happily confirms the interview

No sense that:

The hiring manager had explicitly said they weren't available "tomorrow"
"Tomorrow" had now become "today"
The same day was actually blocked

Humans do this adjustment without thinking.

LLMs do not.

So instead of trying to make the agent "smarter" with more prompts, we did something else:

We gave it a manager.

The pattern: agent + judge

The agent still does the work.

But it doesn't get the last word.

For any important action (like scheduling interviews), we now run this loop:

Agent drafts a plan
- "Confirm the interview for today"
- "Move candidate to INTERVIEW_SCHEDULED"
LLM-as-judge reviews the plan
- It doesn't execute anything
- It just asks: "Does this make sense?"
The judge has exactly one job:
- "Does this plan make sense in linear time?"
- Did someone say they're not available on that date?
- Are we double-booking?
- Did the meaning of "tomorrow" change between messages?
If it fails, the judge sends feedback as the agent's manager
- "You're about to confirm an interview on a day the hiring manager said they're not available. Check the dates."
- "You shouldn't move this candidate to INTERVIEW_SCHEDULED yet; they still need to reschedule."
Agent revises the plan with that feedback
- Re-checks dates
- Adjusts the candidate's stage
- Changes the message
Only then does it execute.

The important mental shift:

The agent's output is a proposal, not the truth.

We always route that proposal through a manager whose only concern is time.

What the agent's reasoning actually looks like

Here's a (redacted) snippet from the real interaction after we added the judge.

First, the agent adapts its plan after the manager flags the issue:

assistant

11/10/2025, 4:06:48 PM

Looking at this situation with the manager's guidance:

1. Manager Direction: Inform the candidate to reschedule since the hiring manager said they're not available tomorrow (and this booking appears to be today/tomorrow)

2. Meeting Link Issue: Need to flag to the client that there's a problem with the meeting link in calendar invites

3. Current Status: Candidate has booked but needs to reschedule

Next steps:

- Email the candidate asking them to reschedule using the same link, mentioning the hiring manager's availability has changed

- Message the hiring manager about both the scheduling conflict and the meeting link issue

- Keep stage as INTERVIEW_SCHEDULED since technically booked but needs rescheduling

The guardrail then corrects the stage logic:

assistant

11/10/2025, 4:07:09 PM

I see the guardrail flagged that I should keep the stage as SCHEDULING_INTERVIEW since we don't have full confirmation of a completed booking (and the candidate needs to reschedule anyway). Let me adjust:

Finally, once the judge is happy, the agent executes the corrected plan:

assistant

11/10/2025, 4:07:15 PM

Perfect! The action plan has been validated. I'm:

1. Informing the candidate that they need to reschedule due to the hiring manager's availability change, and providing the Calendly link again

2. Updating the hiring manager about the scheduling conflict and flagging the meeting link visibility issue

The candidate remains at SCHEDULING_INTERVIEW stage since we're still working to get the interview properly scheduled.

And the email that actually goes out looks like:

user

11/10/2025, 4:07:40 PM

✅ Action Executed: EMAIL_CANDIDATE

Content:

"Hi there,

Thanks for booking so quickly. Unfortunately, the hiring manager's availability has changed and they're no longer available today.

Could you please use the same Calendly link to reschedule for another day?

We'll also look into the meeting link issue you mentioned to make sure everything works smoothly for the rescheduled call.

Best,

The Superposition team"

No human had to jump in and untangle the dates.

The "manager" was just another LLM running a very narrow playbook:

"Only let this through if the timeline makes sense."

Why we didn't just "prompt harder"

There are two naive reactions to this:

"Just tell the LLM: be careful with dates."
"Use a bigger model. The next one will fix it."

I don't buy either.

ChronoSense shows that even very strong models still fail temporal reasoning at non-trivial rates. GPT-4o still gets basic before/after questions wrong around 22% of the time on abstract tasks.

If your recruiter messes up 1 in 5 schedule-related decisions, that's not "slightly annoying."

That's catastrophic.

We treat LLMs like junior engineers:

The first draft is never the final decision
It's a suggestion
It needs review against hard constraints

In this case, the constraint is simple:

The calendar must make sense.

You don't fix that with vibes.

You fix it with explicit supervision.

The guardrail in one sentence

We gave the agent a boss whose only job is to ask:

"Does this plan make sense in linear time?"

That's it.

The judge doesn't care about tone, phrasing, or niceties. It only cares about:

Days
Dates
Availability
Consistency between messages

If something doesn't add up, it blocks the action and sends feedback down to the agent in natural language, just like a human manager would:

"You're confirming an interview on a day the hiring manager said they're not available. Double-check the date and reschedule."

The agent then uses that as context and tries again.

No new tools.

No custom calendar engine.

Just better supervision.

Why this matters for AI recruiting

Most teams face a binary choice:

Ship the naive LLM implementation and accept a bunch of errors
Insist everything gets flagged for human review, which removes all the efficacy of using an LLM in the first place

We took a more sophisticated approach.

We treat the naive LLM output like you might treat a junior employee's work. It's a draft, not the final answer. But instead of having a human manager review every decision, we have another agent act as the manager.

This manager agent has a much more constrained set of parameters it's measuring against. It doesn't need to understand tone, context, or nuance. It only needs to answer one question: does this plan make sense in linear time?

That's why when our recruiter sees "tomorrow" in one message and "today" in another, it doesn't just process the words. It keeps iterating until time makes sense.

If your agent is going to take real actions that cost a real world resource (in our case, actual minutes spent interviewing), a human in the loop is the local maxima.

The global maxima is an LLM as judge that can let your agent iterate until it gets it right.