Agentic Deep Research: Architecting AI Financial Analysts with LangGraph & RAG

A comprehensive technical guide to implementing an adaptive, investigative financial analysis system using LangGraph, Pydantic, and Retrieval-Augmented Generation

Technical Introduction

Financial analysis is complex, requiring the integration of internal company data with macroeconomic indicators, industry trends, and qualitative assessments. Traditional approaches involve specialists manually gathering and interpreting diverse datasets - a time-consuming process with inconsistent results.

We've developed a solution that leverages Large Language Models (LLMs) within a structured graph workflow to automate and enhance this process. This architecture enables:

Systematic investigation of financial hypotheses
Evidence gathering across multiple data sources
Dynamic visualization of insights
Confidence-scored analysis

Our system employs the LangGraph framework to orchestrate a multi-step investigative workflow, structured with Pydantic for type safety, and enhanced with retrieval-augmented generation (RAG) to incorporate domain-specific knowledge.

System Architecture

The financial analysis agent operates as a directed graph with distinct nodes handling specific responsibilities:

Each node represents a function that takes the agent's state as input, performs its specialized task, and returns an updated state with new information:

1def retrieve_initial_mda_context_node(state: AgentState) -> AgentState:
2    # Retrieve and add context to state
3    return updated_state
4
5def fetch_fred_data_node(state: AgentState) -> AgentState:
6    # Fetch economic data and add to state
7    return updated_state
8

The core state object maintains all information throughout the process:

1class AgentState(TypedDict):
2    original_query: str
3    primary_document_path: Optional[str]
4    primary_vector_store_path: Optional[str]
5    initial_mda_contexts: Optional[List[Dict[str, Any]]]
6    sales_data_path: Optional[str]
7    marketing_data_path: Optional[str]
8    fred_api_key: Optional[str]
9    fred_series_ids: Optional[List[str]]
10    external_economic_data: Optional[Dict[str, List[Dict[str, Any]]]]
11    hypotheses: Optional[List[Dict[str, Any]]]
12    gathered_evidence: Optional[List[Dict[str, Any]]]
13    charts_for_report: Optional[List[StructuredChart]]
14    current_iteration: int
15    max_iterations: int
16    report_content: Optional[List[ReportItem]]
17    error_message: Optional[str]
18    final_report_generated: bool
19

LangGraph Concurrency Multiple evidence-gathering nodes may run in parallel. Guard shared lists (`gathered_evidence`) with thread-safe updates or use a dedicated "Aggregator" node to merge child results deterministically. GitHub discussions indicate this is a common implementation pitfall when scaling LangGraph applications in production environments.

Deep Research: First Principles for Evidence-Based Financial Analysis

At its core, our system applies four fundamental principles of effective financial research:

1. Hypothesis-Driven Investigation

Rather than aimlessly analyzing data, we structure investigation around testable hypotheses:

1class HypothesisOutput(BaseModel):
2    id: str = Field(description="Unique identifier for the hypothesis")
3    text: str = Field(description="The full text of the hypothesis")
4    confidence: float = Field(description="A score between 0.0 and 1.0 indicating confidence")
5    status: str = Field(description="Current status: 'pending', 'investigated', 'confirmed', etc.")
6

This approach ensures focused analysis and allows for precise tracking of confidence levels.

2. Multi-Source Evidence Gathering

True insight emerges from triangulating multiple data sources:

1DataSourceType = Literal[
2    "QDRANT_MDA",
3    "INTERNAL_CSV_SALES",
4    "INTERNAL_CSV_MARKETING",
5    "FRED_API",
6    "WEB_SEARCH"
7]
8

For example, when investigating revenue changes, the system may:

Retrieve qualitative insights from MD&A documents
Analyze quantitative sales data from internal CSVs
Correlate with economic indicators from FRED

Note-Align Time-Series Granularity FRED macro data is often monthly/quarterly while sales CSVs are daily. Resample both series to a common granularity (e.g., monthly averages) before correlation to avoid spurious insight ("correlation-by-time-shift" bias). The FRED API allows frequency aggregation through its observation endpoints, which can help align time-series data properly.

Initial hypotheses rarely capture the full picture. Our system employs an iterative approach:

1def should_iterate_node(state: AgentState) -> Literal["refine_hypotheses", "synthesize_comprehensive_explanation"]:
2    current_iter = state.get("current_iteration", 0)
3    max_iter = state.get("max_iterations", 1)
4
5    pending_hypotheses = any(h.get("status") == "pending" for h in state.get("hypotheses", []))
6    if not pending_hypotheses and current_iter > 0:
7        return "synthesize_comprehensive_explanation"
8
9    return "refine_hypotheses"
10

This creates a feedback loop where new evidence informs hypothesis refinement until convergence.

Tip — Enable Iteration Short-Circuit Add a *time-budget* or *cost-budget* guard in `should_iterate_node`. If runtime or spend exceeds a threshold, jump straight to synthesis with a "partial findings" flag to optimize for performance in production environments.

4. Quantified Uncertainty

Financial analysis inherently involves uncertainty. We make this explicit through confidence scoring:

1# Example hypothesis with confidence score
2{
3    "id": "H1",
4    "text": "The deceleration in revenue growth is primarily due to external macroeconomic factors",
5    "confidence": 0.75,
6    "status": "confirmed"
7}
8

Note — Confidence ≠ Probability Confidence scores are relative signals produced by the LLM, not calibrated probabilities. Run periodic calibration if you must treat them as probabilities in downstream risk models.

System Flow: Building Confidence Through Iterative Analysis

The agent's investigative process incorporates confidence assessment at multiple stages, creating a transparent and traceable analysis path:

Confidence-Driven Hypothesis Generation

Each hypothesis is assigned a numerical confidence score (0.0-1.0) by the LLM based on:

1class HypothesisOutput(BaseModel):
2    id: str = Field(...)
3    text: str = Field(...)
4    confidence: float = Field(description="A score between 0.0 and 1.0 indicating confidence in the hypothesis.")
5    status: str = Field(...)
6

Initial Scoring: When generating hypotheses, the LLM evaluates how well each potential explanation aligns with the MD&A context and economic data
Quantified Uncertainty: Presenting financial insights with calibrated confidence scores (e.g., "75% confidence") helps users understand the reliability of different explanations

As the system iterates through evidence gathering:

Confidence Evolution: During hypothesis refinement, scores are dynamically adjusted- Evidence supporting a hypothesis increases its confidence score- Contradictory evidence decreases confidence
Status Tracking: The system maintains hypothesis status (`pending`, `investigated`, `confirmed`, `refuted`) alongside confidence
Evidence Quality Assessment: The `executed_success` or `executed_error` status of evidence gathering directly impacts confidence in related hypotheses

1# After processing all actions for a hypothesis, determine its overall status
2all_actions_successful = True
3any_action_executed = False
4for ev_item in state["gathered_evidence"]:
5    if ev_item.get("hypothesis_id") == hypo_id:
6        if ev_item.get("status") == "executed_success":
7            any_action_executed = True
8        elif ev_item.get("status") == "executed_error":
9            all_actions_successful = False
10            any_action_executed = True # an attempt was made
11

Transparency Through Visual Confidence Indicators

The final report integrates confidence scores:

Explicit Confidence Display: Hypotheses are presented with their confidence scores (e.g., "Hypothesis H1 (Confidence: 75%)")
Evidence-Hypothesis Mapping: Charts and evidence points are linked to specific hypotheses, creating a clear audit trail

This confidence-centric design allows users to:

Distinguish between high-confidence insights and speculative possibilities
Understand the evidential basis for each conclusion
Make informed decisions with appropriate levels of caution

Technical Components and Implementation

Structured Data Flow with Pydantic

Type safety is crucial for a complex system integrating multiple data sources. We use Pydantic throughout:

1class StructuredChart(TypedDict):
2    chart_id: str
3    chart_type: ChartType
4    title: str
5    data: List[ChartDataPoint]
6    x_axis_key: Optional[str]
7    y_axis_keys: Optional[List[str]]
8    description: Optional[str]
9
10class EvidenceSourceAction(BaseModel):
11    data_source: DataSourceType = Field(...)
12    query_or_parameters: Union[str, Dict[str, Any]] = Field(...)
13    reasoning: str = Field(...)
14    chart_suggestion: Optional[ChartSuggestion] = Field(...)
15

This enforces data validation at runtime and enables structured LLM outputs.

Note — Protect Internal PII Never embed raw customer-level finance records. Mask personal identifiers, aggregate at the cohort level, or vectorize on-prem and push only embeddings to Qdrant to avoid leaking sensitive data through prompts or logs. Security research confirms that vector databases like Qdrant require application-level encryption to fully protect against PII exposure.

Best Practice — Unit-Test Every Node For each LangGraph node, write a small deterministic test that feeds a stubbed state and asserts the returned schema. This catches drift when prompts or schemas change. The LangGraph documentation recommends node-level testing strategies for reliable production deployments.

LLM-Driven Chart Generation

The system dynamically creates visualizations based on LLM suggestions:

1def _create_chart_from_suggestion(
2    hypothesis_id: str,
3    action_data_source: DataSourceType,
4    chart_suggestion: ChartSuggestion,
5    retrieved_data: List[Dict[str, Any]],
6    charts_for_report_list: List[StructuredChart]
7) -> Union[str, Dict[str, str]]:
8    # Chart creation logic
9    # ...
10    new_chart = PydanticStructuredChart(
11        chart_id=chart_id,
12        chart_type=chart_suggestion.chart_type,
13        title=chart_suggestion.title_suggestion,
14        data=chart_data_points,
15        x_axis_key=x_key_to_use,
16        y_axis_keys=y_keys_to_use,
17        description=f"Chart generated for Hypothesis {hypothesis_id} from {action_data_source} data."
18    )
19
20    charts_for_report_list.append(new_chart.model_dump())
21    return chart_id
22

This allows dynamic visualization of evidence while maintaining traceability to specific hypotheses.

Note — Chart Suggestion Prompt "When suggesting a chart, always specify: `chart_type`, `x_axis_key`, `y_axis_keys`, and a `title_suggestion`. Skip chart creation if the retrieved data has < 3 points." Explicit format instructions help prevent "hallucination" of non-existent charts.

Economic Data Integration with FRED API

The system incorporates external economic indicators through the FRED API:

1def _execute_fred_api_action(
2    action: EvidenceSourceAction,
3    api_key: Optional[str],
4    existing_fred_data: Optional[Dict[str, List[Dict[str, Any]]]]
5) -> Dict[str, Any]:
6    # Extract series ID from action
7    # Check for cached data
8    # Fetch from FRED API if needed
9    url = f"https://api.stlouisfed.org/fred/series/observations"
10    params = {
11        "series_id": series_id_to_fetch,
12        "api_key": api_key,
13        "file_type": "json",
14        "observation_start": start_date_str,
15        "observation_end": end_date_str,
16        "sort_order": "asc"
17    }
18    # Process response and return data
19

This enables correlation between company performance and broader economic trends.

Intelligent Document Retrieval with RAG

The system integrates content from MD&A documents using retrieval-augmented generation:

1def retrieve_initial_mda_context_node(state: AgentState) -> AgentState:
2    # ...
3    try:
4        embeddings = OpenAIEmbeddings(openai_api_key=settings.openai_api_key, model=settings.embedding_model)
5
6        qdrant_store = Qdrant.from_existing_collection(
7            embedding=embeddings,
8            path=vector_store_path,
9            collection_name=QDRANT_COLLECTION_NAME,
10        )
11
12        retrieved_docs: List[Document] = qdrant_store.similarity_search(query, k=3)
13
14        initial_contexts = []
15        for doc in retrieved_docs:
16            initial_contexts.append(
17                {"id": doc.metadata.get("id", f"doc_{len(initial_contexts)}"),
18                 "text": doc.page_content,
19                 "metadata": doc.metadata}
20            )
21        # ...
22

This provides crucial context from company documents to inform hypothesis generation.

Note — Prompt Injection via Web Search Strip HTML/JS, run a profanity/regex filter, and optionally summarize with a separate LLM before sending web results to your main reasoning model. The OWASP Top 10 for LLM Security identifies indirect prompt injection via web search as a significant risk for LLM agents.

Performance and Scalability Considerations

Memory Efficiency

The system is designed to handle large datasets efficiently:

Streaming Data Approach: Economic data is processed in chunks rather than loaded entirely in memory
Vector Store Indexing: Document search uses efficient ANN algorithms via Qdrant for sub-linear ANN retrieval performance
State Management: The TypedDict approach allows selective updates to the state without copying large data structures

Performance Tip — Batch Embedding Calls Embed chunks in batches of 96–128 tokens to maximize OpenAI throughput; this alone can cut embedding cost by 30-40%. Batching also improves performance and latency when processing large documents.

Request Batching

To minimize API costs and latency:

1# Processing multiple hypotheses in one iteration
2for hypo in pending_hypotheses:
3    # Generate evidence plan once per hypothesis
4    evidence_plan_output = _llm_call_create_evidence_gathering_plan(...)
5
6    # Execute multiple actions from the plan
7    for action_idx, action in enumerate(evidence_plan_output.actions):
8        # Execute action...
9

Response Time Optimization

Balancing thoroughness with responsiveness:

Caching FRED Data: Economic data is cached to avoid redundant API calls
Parameterized Iteration Control: The max_iterations parameter allows trading off depth for speed
Progressive Reporting: The system can return partial results before all iterations complete

Implementation Pitfalls and Technical Debt

LLM Instruction Optimization

Careful prompt engineering is critical:

1prompt_template_str = f"""You are an expert financial analyst AI. Your task is to synthesize a comprehensive, easy-to-read financial report based on the provided information. The report should interleave textual explanations with relevant charts.
2
3Context:
41. Original User Query: {original_query}
52. MD&A Context Summary (from company filings):
6{mda_summary}
7...
8

Key lessons learned:

Explicit formatting instructions prevent "hallucination" of non-existent charts
Including reasoning fields helps trace LLM decisions
Breaking complex tasks into discrete steps improves reliability
Need to bake some evals to counter hallucinations.

Error Handling for External Services

Multiple fallback mechanisms ensure resilience:

1try:
2    response = requests.get(url, params=params, timeout=10)
3    response.raise_for_status()
4    # Process response...
5except requests.exceptions.RequestException as e:
6    # Handle API error...
7except Exception as e:
8    # General fallback...
9

Type Safety Challenges

Working with LLMs requires robust validation:

1def _ensure_serializable(data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
2    """Convert pandas Timestamps and other non-serializable types to strings in the data"""
3    serialized_data = []
4    for item in data:
5        serialized_item = {}
6        for key, value in item.items():
7            # Convert pandas Timestamps to ISO format strings
8            if isinstance(value, pd.Timestamp):
9                serialized_item[key] = value.isoformat()
10            # Convert numpy types to Python native types
11            elif isinstance(value, (np.int64, np.int32, np.int16, np.int8)):
12                serialized_item[key] = int(value)
13            # ...
14

Technical Monitoring and Observability

Structured Logging

Comprehensive logging enables effective monitoring:

1logger = logging.getLogger(__name__)
2
3# Throughout the code:
4logger.info(f"Executing {data_source_type} action: Analyzing '{csv_path}' with LLM-driven query: '{query_desc}'")
5logger.error(f"Error during QDRANT_MDA action for query '{query[:50]}...': {str(e)}")
6

State Transition Tracking

The graph structure enables clear visibility into the workflow:

1def should_iterate_node(state: AgentState) -> Literal["refine_hypotheses", "synthesize_comprehensive_explanation"]:
2    current_iter = state.get("current_iteration", 0)
3    max_iter = state.get("max_iterations", 1)
4
5    logger.info(f"DEBUG should_iterate_node: Current Iteration (completed): {current_iter}, Max Iterations: {max_iter}")
6    # ...
7

Compliance Note — Audit Trail Persist every hypothesis, evidence artifact, confidence update, and LLM response with a UUID + timestamp. This is invaluable for audits and post-mortems. Financial applications require comprehensive traceability.

Conclusion and Key Takeaways

The financial analysis agent demonstrates how structured LLM workflows can solve complex domain-specific problems:

Hypothesis-driven investigation produces more focused, relevant insights than unstructured analysis
Multi-source evidence gathering enables comprehensive understanding through triangulation
Confidence scoring makes uncertainty explicit, enhancing decision quality
Graph-based workflows provide a modular, extensible architecture for complex AI systems
Type safety with Pydantic ensures reliable operation when integrating LLMs with structured data

Note — Evaluate Before Deploying Run dataset-driven evals (precision of hypothesis confirmations, hallucination rate) before exposing the agent to analysts. Automate nightly regression tests to catch drift. Current research shows that continuous evaluation is critical for maintaining LLM system performance over time.

The architecture can be extended to other domains requiring systematic investigation, hypothesis testing, and evidence-based conclusions - from scientific research to complex diagnostics.

Expert Implementation Support

Need help implementing this architecture for your financial analysis system?

I specialize in designing and implementing advanced LLM-powered agent systems with a focus on financial analysis, data integration, and hypothesis-driven workflows. Having built multiple production systems using LangGraph, RAG, and confidence-scored analytics, I can help your team:

Design custom agentic workflows tailored to your financial data ecosystem
Implement robust type safety and validation layers for LLM outputs
Establish confidence scoring methodologies aligned with your risk tolerance
Create seamless integrations with your existing data infrastructure
Train your engineering team on LangGraph and agent development best practices

Book a technical discovery session: Schedule a 30-minute consultation where we'll discuss your specific implementation challenges and identify the fastest path to a production-ready system. Email saumil@saumilsrivastava.ai or visit saumilsrivastava.ai/book-consultation to get started.