Skip to content

Latest commit

 

History

History
193 lines (136 loc) · 6.24 KB

File metadata and controls

193 lines (136 loc) · 6.24 KB

Example Use Case: Software Development Consultancy Finland

This document demonstrates the example use case from the bootcamp requirements.

Use Case Description

Search Term: "Software development consultancy finland"
Goal: Analyze how companies describe their values
Expected Output: A table showing each company's values classified as soft or hard

Running the Example

Option 1: Using the Web Interface

  1. Start the application:

    streamlit run app.py
  2. Go to the Search-Based tab

  3. Enter:

    • Search term: Software development consultancy finland
    • Number of results: 5
  4. Click "Start Search-Based Crawl"

  5. Wait for results (typically 2-5 minutes)

Option 2: Using the Command Line

Run the built-in example:

python orchestrator.py

This automatically runs the example use case and generates reports.

Option 3: Programmatic Usage

import asyncio
from orchestrator import CrawlOrchestrator

async def run_example():
    orchestrator = CrawlOrchestrator()
    
    results = await orchestrator.run_search_based_crawl(
        search_term="Software development consultancy finland",
        num_results=5
    )
    
    if results["success"]:
        print(f"✅ Analyzed {results['num_companies']} companies")
        print(f"📊 Report: {results['reports']['aggregate_excel']}")
        
        # Print summary
        for analysis in results["analyses"]:
            print(f"\n{analysis.company_name}:")
            print(f"  Soft: {', '.join(analysis.soft_values)}")
            print(f"  Hard: {', '.join(analysis.hard_values)}")

asyncio.run(run_example())

Expected Results

The system will:

  1. Search for "Software development consultancy finland" on Google
  2. Find approximately 5 relevant company websites
  3. Crawl each website using AI to navigate to important pages:
    • Homepage
    • About/Company pages
    • Values/Mission pages
    • Culture/Team pages
  4. Extract values statements from each company
  5. Classify values as:
    • Soft values: People-oriented (e.g., caring, openness, collaboration, trust)
    • Hard values: Business-oriented (e.g., innovation, efficiency, quality, excellence)
  6. Generate reports:
    • Excel table with all companies
    • Individual markdown reports per company
    • Summary statistics

Sample Output Table

Company Website Soft Values Hard Values Orientation Summary
Company A companya.fi Trust, Collaboration, Openness Innovation, Quality Balanced Emphasizes both team culture and technical excellence
Company B companyb.fi Caring, Transparency Efficiency, Results, Excellence Business-Focused Performance-driven with strong delivery focus
Company C companyc.fi Diversity, Well-being, Empathy Innovation, Scalability People-Focused Strong emphasis on employee experience and culture

Analysis Insights

After running the example, you might find:

  • Common soft values: Trust, collaboration, openness, transparency
  • Common hard values: Innovation, quality, excellence, customer focus
  • Orientation: Finnish software consultancies often balance culture and performance
  • Unique patterns: Focus on Nordic values like transparency and work-life balance

Files Generated

After running the example, check these directories:

  1. ./reports/:

    • aggregate_results_TIMESTAMP.xlsx - Main results table
    • aggregate_results_TIMESTAMP.csv - CSV version
    • aggregate_summary_TIMESTAMP.md - Summary report
    • CompanyName_TIMESTAMP.md - Individual reports for each company
  2. ./outputs/:

    • Temporary files and logs

Troubleshooting the Example

"No results found"

  • Check your internet connection
  • Try with fewer results (e.g., 3 instead of 5)
  • Google might be rate-limiting; wait a few minutes

"API rate limit exceeded"

  • You're making too many LLM calls
  • Reduce number of results
  • Check your API quota

"Low confidence scores"

  • Some companies don't clearly state their values
  • The AI infers what it can from available content
  • This is expected behavior

"Takes too long"

  • Each company takes 1-3 minutes to crawl and analyze
  • 5 companies = 5-15 minutes total
  • Be patient; AI navigation takes time

Understanding the AI Navigation

Unlike simple HTML parsers, this system:

  1. Renders pages like a real browser (JavaScript, dynamic content)
  2. Reads the page content to understand it
  3. Uses AI to decide which links are most relevant
  4. Follows the most promising links (about, values, mission pages)
  5. Extracts meaningful content while filtering navigation/ads
  6. Analyzes the aggregated content to find values

This is what makes it "AI-assisted web crawling" rather than simple scraping.

Tips for Better Results

  1. Be specific with search terms:

    • Good: "Software development consultancy Helsinki"
    • Bad: "Companies in Finland"
  2. Start small: Test with 2-3 companies first

  3. Check the logs: Watch the console for progress updates

  4. Review individual reports: They contain more detail than the table

  5. Adjust settings in .env if needed:

    • Increase MAX_CRAWL_DEPTH for deeper crawling
    • Increase CRAWL_TIMEOUT for slow sites

Success Criteria

Your example run is successful if:

  • ✅ Finds at least 3 company websites
  • ✅ Successfully crawls each website
  • ✅ Extracts some values (even if not all are clear)
  • ✅ Classifies values into soft/hard categories
  • ✅ Generates the results table
  • ✅ Creates individual reports

Even if some companies have low confidence scores or unclear values, the system is working correctly—some companies simply don't have clear values statements on their websites.

Next Steps

After running the example:

  1. Review the results: Open the Excel file in ./reports/
  2. Check individual reports: Read the markdown files
  3. Try your own search: Use different search terms
  4. Upload a CSV: Test the CSV-based input method
  5. Experiment: Try different configurations in .env

Remember: This is a bootcamp prototype for learning. The goal is to understand AI-powered web crawling and agentic systems, not to build a production-ready tool.