Skip to content

[Bug]: Race Condition in MCP Tools Caching #124

@web3devz

Description

@web3devz

Bug Summary

The MCP tools caching mechanism in ToolCallAgent has a race condition where multiple concurrent requests can cause cache corruption and memory leaks.

Severity: MEDIUM-HIGH

Bug Details

Location

  • File: /spoon_ai/agents/toolcall.py
  • Lines: 42-86 (MCP tools caching implementation)
  • Method: _get_cached_mcp_tools()

Issue Description

The current caching implementation has several critical issues:

async def _get_cached_mcp_tools(self) -> List[MCPTool]:
    """Get MCP tools with caching to avoid repeated server calls."""
    current_time = time.time()
    
    # Thread-safe cache check
    async with asyncio.Lock() if not hasattr(self, '_cache_lock') else asyncio.Lock():  # BUG: Creates new lock each time
        if not hasattr(self, '_cache_lock'):
            self._cache_lock = asyncio.Lock()  # BUG: Race condition here

Specific Issues Identified

  1. Lock Creation Race Condition:

    • Multiple coroutines can pass the hasattr check simultaneously
    • Each creates a new lock, defeating the purpose of synchronization
    • Can lead to cache corruption when multiple agents access simultaneously
  2. Memory Leak in Cache Management:

    • Cache lock objects are created but never properly cleaned up
    • Agent instances accumulate locks over time
    • No mechanism to clean up expired cache locks
  3. Inefficient Lock Usage:

    • Creates new asyncio.Lock() in the conditional statement
    • Should use the existing _cache_lock if available
    • Current code always creates a temporary lock

Impact

  • Concurrency Issues: Multiple agents fetching tools simultaneously can corrupt cache
  • Memory Leaks: Accumulation of lock objects over time
  • Performance: Inefficient lock creation and cache misses
  • Reliability: Potential deadlocks in high-concurrency scenarios

Steps to Reproduce

  1. Create multiple ToolCallAgent instances
  2. Call _get_cached_mcp_tools() concurrently from multiple coroutines
  3. Observe race conditions in cache lock creation
  4. Monitor memory usage over time to see lock accumulation

Expected Behavior

  • Thread-safe cache access with proper lock management
  • No memory leaks from accumulated lock objects
  • Consistent cache behavior under high concurrency
  • Proper cache invalidation and cleanup

Current Behavior

  • Race conditions during lock initialization
  • Memory leaks from improper lock cleanup
  • Potential cache corruption under concurrent access
  • Inconsistent cache behavior

Fix Required

def __init__(self, **kwargs):
    super().__init__(**kwargs)
    self._cache_lock = asyncio.Lock()  # Initialize lock in constructor

async def _get_cached_mcp_tools(self) -> List[MCPTool]:
    """Get MCP tools with caching to avoid repeated server calls."""
    current_time = time.time()
    
    async with self._cache_lock:  # Use pre-initialized lock
        # Rest of caching logic...

Business Impact

  • Reliability: Cache corruption can cause tool execution failures
  • Performance: Memory leaks affect long-running agent processes
  • Scalability: Race conditions limit concurrent agent capacity
  • User Experience: Inconsistent tool availability and execution

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions