Building a Stock Analysis Agent with LangChain

2025-12-31

I wanted to build something that could pull stock data and actually say something useful about it. Not just "here's the price" but "here's what's interesting about this stock right now." The kind of summary a human analyst would write.

The result: a LangChain agent that fetches data from Polygon.io, feeds it to Claude, and returns structured analysis. The interesting parts were getting LangChain's tooling to work smoothly and building evals to catch when the LLM hallucinates numbers.

Polygon.io ──▶ LangChain Tools ──▶ Claude ──▶ Structured Output
     │                                              │
     └──────────── LangSmith Tracing ───────────────┘

The @tool decorator

LangChain's @tool decorator is the cleanest part of the framework. Slap it on a function, write a good docstring, and you've got an LLM-callable tool:

@tool
def get_stock_snapshot(ticker: str) -> dict:
    """Fetch current stock price, volume, and day's high/low."""
    data = requests.get(f"https://api.polygon.io/v2/snapshot/.../tickers/{ticker}",
                        params={"apiKey": API_KEY}).json()
    return {
        "current_price": data["ticker"]["lastTrade"]["p"],
        "volume": data["ticker"]["day"]["v"],
    }

I built three of these: one for real-time snapshots, one for 30-day aggregates, one for company info. The docstrings matter. They're what the LLM sees when deciding which tool to call.

Forcing structured output

The problem with just asking Claude "analyze this stock" is you get prose. Sometimes it mentions risk factors, sometimes it doesn't. Sometimes the format changes. Pain to parse.

Pydantic fixes this:

class StockAnalysis(BaseModel):
    ticker: str
    key_metrics: str
    price_trend: str
    risk_factors: str
    overall_summary: str

Then llm.with_structured_output(StockAnalysis) forces Claude to fill in every field. No more optional sections.

LCEL chains

LangChain Expression Language took me a bit to grok. The | operator pipes data through stages:

chain = (
    RunnablePassthrough()
    | RunnableLambda(gather_stock_data)  # calls all 3 Polygon tools
    | analysis_prompt
    | structured_llm
)

One gotcha: prompt templates can't access nested dicts. So gather_stock_data has to flatten everything. snapshot["day"]["high"] becomes just day_high. Annoying but manageable.

The interesting part: evals

Here's what I actually spent most of my time on. It's easy to get a demo working. It's hard to know if it's good.

LangSmith lets you define test cases and custom evaluators. My evaluators check:

The hallucination evaluator compares numbers in the output against numbers in the input. If Claude says "market cap of $2.8T" but the input said $2.5T, that's a flag.

def check_hallucinations(run: Run, example: Example) -> dict:
    output_numbers = extract_numbers(str(run.outputs))
    input_numbers = extract_numbers(str(run.inputs))

    matched = [n for n in output_numbers if any(close_enough(n, i) for i in input_numbers)]
    hallucination_rate = 1 - (len(matched) / len(output_numbers))

    return {"key": "hallucinations", "score": 1 - hallucination_rate}

Running evals:

results = evaluate(
    lambda inputs: chain.invoke(inputs),
    data="stock-analysis-test",
    evaluators=[check_cites_numbers, check_hallucinations, ...],
)

Sample output:

Example 1/4: AAPL
  ✅ expected_mentions: 1.0 (Found 3/3: Apple, market cap, volume)
  ✅ cites_numbers: 1.0 (Found 12 numbers)
  ✅ hallucinations: 0.95 (Matched 11/12 numbers)

Example 2/4: TSLA
  ⚠️ hallucinations: 0.87 (2 numbers not found in input)

Aggregate: 0.91 hallucination score

That TSLA result is the kind of thing you only catch with evals. The analysis looked fine. The numbers were wrong.

Running it

python src/main.py --ticker NVDA
Ticker: NVDA
Key Metrics: $142.50 current price, 48.2M volume, $3.5T market cap
Price Trend: Up 12.3% over 30 days, testing resistance at $145
Risk Factors: Elevated P/E ratio, AI spending concentration risk

What I'd do differently

Polygon's free tier only allows 5 API calls/minute. Should've added Redis caching from the start. Also, the three tool calls run sequentially. They could be parallel since they're independent.

The hallucination evaluator is crude. It just looks for number matches within 5% tolerance. A better version would understand that "nearly $3T" matching "$2.8T" is probably fine, but "$180 price target" appearing from nowhere is not.