I fed our entire codebase to Gemini 3.1 Pro (2M context). Here's what happened.

What it's like to dump 800,000 tokens of production code into an AI and ask it to find problems. Genuinely useful, occasionally wrong.

Alex Kim

Frontend engineer at a Series B startup. Previously Google. Obsessed with performance and developer experience.

Gemini 3.1 Pro has a 2 million token context window. Our entire frontend codebase — 340 files, ~800,000 tokens — fits inside it with room to spare.

I spent a day testing what this actually enables. Short version: it's more useful than I expected and more limited than the marketing suggests.

What I tested

Our frontend is a Next.js application that's been in development for two years. It has the usual problems of any two-year-old codebase: inconsistent patterns, deprecated approaches mixed with newer ones, technical debt we know about and technical debt we don't.

I uploaded the entire codebase (zipped) and ran several experiments.

The experiments

"Find all the technical debt"

Prompt: "This is our complete frontend codebase. Identify the top 10 technical debt items by severity. Focus on things that will cause real problems — performance, maintainability, security — not stylistic preferences."

What I got: A genuinely useful list. The top item it flagged was something I knew about (we were still using a deprecated React pattern for context that causes unnecessary re-renders). The second was something I didn't know about — we had three different implementations of the same data fetching pattern across different parts of the app, and two of them had subtle race conditions under specific network conditions.

Items 3-7 were real issues of varying severity. Items 8-10 were real but minor.

None were hallucinated — I checked everything. That surprised me.

"Find inconsistent patterns"

Prompt: "Map all the different patterns used for the same problems: data fetching, error handling, form validation, authentication checks. Tell me which pattern is best for each and what would need to change."

This was the most useful thing I ran. It found 4 different patterns for handling loading states, 3 for error boundaries, 2 for authentication guards (one of which was subtly broken in a way that only matters on slow connections).

The recommendation for which to standardize on was good — it picked the patterns that were already most consistently used in the newest parts of the codebase.

"Explain this architecture to a new engineer"

Prompt: "Write an architecture overview for a new engineer joining the team. Cover: what this app does, how data flows, what the key abstractions are, what the gotchas are."

Surprisingly good output. Better than our actual documentation. I sent it to our last two hires and both said it was more useful than what we had.

"Find security issues"

This is where it was weakest. It flagged some real issues (CSP headers could be stricter, one component was rendering user input without sanitization). But it also produced two false positives that would have sent me on a debugging chase if I'd trusted them blindly.

The sanitization issue was real though — we shipped a fix based on it.

The limitations

It's not fast. Uploading ~800K tokens takes a minute. Processing and generating a response takes 2-4 minutes. This isn't a "quick question" tool — it's a "start this analysis and come back" tool.

It doesn't know your business context. It can find that a function is called inconsistently, but it can't tell you if that's intentional (because the inconsistency reflects a real product difference) or accidental. You have to apply that judgment.

It hallucinates on specifics sometimes. For the "find security issues" task, one of the false positives was citing a specific function that didn't exist. It had generated a plausible-sounding function name that wasn't actually in the codebase. This is the scary kind of mistake because it sounds authoritative.

The 2M token limit is the ceiling, not the starting point. It works better with less context. When I uploaded just the auth-related files instead of everything, the security analysis was sharper. Focused context beats maximum context for specific questions.

Practical recommendations

If you have a large codebase and want to try this:

1. Start with a specific question, not "find all problems"

2. Cross-check everything against the actual code

3. Use it for analysis and discovery, not as a source of truth

4. Upload related files together rather than the whole repo for focused questions

5. The architecture documentation use case is underrated — great for onboarding

The 2 million context window is a real capability. It's not magic but it's useful in ways that smaller context models can't replicate.

geminilarge contextcode reviewarchitecture

Try NeonCodex AI free

Claude Sonnet 4.6, GPT-5.5, Gemini — all in one platform.

Start free →