Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

Building an AI-Driven Intelligent Web Scraper with IBM Watsonx & Firecrawl

By Anusha Garlapati posted 20 days ago

  

Contributors: M N V Satya Sai Muni Parmesh, Anusha Garlapati

Introduction

Discover how to supercharge your web scraping workflow by combining the power of IBM Watsonx's generative AI with Firecrawl, a robust web scraping and mapping tool In today’s AI-driven era, web scraping has moved far beyond basic HTML parsing into the realm of intelligent, context-aware data extraction. By combining IBM WatsonX LLM - for semantic reasoning, smart keyword generation, relevance ranking, and precise fact extraction - with Firecrawl, a powerful web mapping and multi-format scraping API, you can build an adaptive pipeline that:

  • Identifies only the most relevant pages aligned with your objective
  • Scrapes clean, structured data in formats like markdown or JSON
  • Extracts goal-specific insights without manual rules or filtering

This synergy of generative AI and advanced crawling transforms raw data into actionable knowledge, making your scraping workflows faster, smarter, and far more accurate. This guide walks you through a fixed, production-ready implementation you can adapt for automated data extraction, knowledge curation, and advanced search. 

Why IBM WatsonX + Firecrawl? 

  • IBM WatsonX LLM uses complex models for semantic search, ranking, and custom extraction.
  • Firecrawl maps, crawls, and extracts structured content from websites quickly, making it perfect for large-scale or focused scraping activities.
  • Smooth Integration: Streamline the process from link discovery to deep information extraction. 

Architecture Overview 

Our pipeline hasfive phases: 

  1. Environment Setup: Configure Firecrawl & WatsonX credentials.
  2. Smart Mapping: Use WatsonX to generate specific search phrases for Firecrawl link mapping.
  3. Link Ranking: WatsonX ranks mapped URLs based on their relevance to your objective.
  4. Scraping Retrieve structured data from top-ranked pages.
  5. AI-based Extraction: Extract only relevant facts for the objective 

Full Code With Detailed Explanations  

Below is the production-ready implementation with inline guidance.

import os 
import json 
from firecrawl import FirecrawlApp 
from dotenv import load_dotenv 
from langchain_ibm import WatsonxLLM 
# Coloured Log Output for Debugging 
class Colors: 
    CYAN = '\033[96m' 
    YELLOW = '\033[93m' 
    GREEN = '\033[92m' 
    RED = '\033[91m' 
    MAGENTA = '\033[95m' 
    BLUE = '\033[94m' 
    RESET = '\033[0m' 

# 1. Environment Setup 
load_dotenv() 
# Credentials (Use environment variables in production) 
firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY", "<your-key>") 
watsonx_api_key = os.getenv("WATSONX_API_KEY", "<your-key>") 
watsonx_project_id = "<your-project-id>" 
watsonx_url = os.getenv("WATSONX_URL", "https://us-south.ml.cloud.ibm.com") 

# Initialize Firecrawl 
app = FirecrawlApp(api_key=firecrawl_api_key) 

# Initialize Watsonx LLM 
llm = WatsonxLLM( 
    model_id="mistralai/mistral-medium-2505", 
    url=watsonx_url, 
    project_id=watsonx_project_id, 
    apikey=watsonx_api_key, 
    params={ 
        "decoding_method": "greedy", 
        "max_new_tokens": 1024, 
        "temperature": 0.35, 
        "top_k": 50 
    } 
) 

Step 1 - Mapping Relevant Pages

We generate a search keyword for the target objective using Watsonx, then map the website with Firecrawl. 

def debug_firecrawl_map(url, search_param): 
    """Debug function to test Firecrawl mapping""" 
    # Test without search parameter first 
    map_result_no_search = app.map_url(url) 
    # Test with search parameter 
    map_result_with_search = app.map_url(url, search=search_param) 
    return map_result_no_search, map_result_with_search 

find_relevant_page_via_map_fixed(objective, url, app, llm):

  • WatsonX generates a search keyword from your objective.
  • Firecrawl maps the URL, optionally with the search keyword.
  • Extracts and ranks up to 10 links, selecting the top 3 most relevant. 
def find_relevant_page_via_map_fixed(objective, url, app, llm): 
    """Fixed version with structured debugging and no try/except""" 
    print(f"{Colors.CYAN}Objective: {objective}{Colors.RESET}") 
    # Step A: Generate search keyword 
    map_prompt = f""" 
    Based on this objective: {objective} 
    Generate a simple 1-2 word search term that would help find relevant pages. 
    Examples: If looking for founding date, use "founded" or "history" 
    If looking for contact info, use "contact" or "about" 
    Only respond with 1-2 words, nothing else. 
    """ 
    map_search_parameter = llm.invoke(map_prompt).strip() 
    print(f"{Colors.GREEN}Search keyword: {map_search_parameter}{Colors.RESET}") 
    # Step B: Debug the mapping process 
    no_search_result, with_search_result = debug_firecrawl_map(url, map_search_parameter) 
    # Step C: Extract links from API response 
    links = [] 
    # 1. Try with search parameter 
    if with_search_result and isinstance(with_search_result, dict): 
        if with_search_result.get('status') == 'success': 
            links = with_search_result.get('links', []) 
            print(f"{Colors.GREEN}Found {len(links)} links with search parameter{Colors.RESET}") 
    # 2. If no results, try without search parameter 
    if not links and no_search_result and isinstance(no_search_result, dict): 
        print(f"{Colors.YELLOW}No results with search, trying without search parameter{Colors.RESET}") 
        if no_search_result.get('status') == 'success': 
            links = no_search_result.get('links', []) 
            print(f"{Colors.GREEN}Found {len(links)} links without search parameter{Colors.RESET}") 
    # 3. If still nothing, handle MapResponse object format 
    if not links: 
        print(f"{Colors.YELLOW}Trying to extract from response objects{Colors.RESET}") 
        for result in [with_search_result, no_search_result]: 
            if result and hasattr(result, 'links'): 
                links = result.links 
                break 
            elif result and hasattr(result, '__dict__') and 'links' in result.__dict__: 
                links = result.__dict__['links'] 
                break 
    # If no links at all, fallback to original URL 
    if not links: 
        return [url] 
    print(f"{Colors.GREEN}Found {len(links)} links{Colors.RESET}") 
    # Limit to reasonable number for ranking 
    if len(links) > 10: 
        links = links[:10] 
    # Step D: Rank URLs 
    rank_prompt = f""" 
    Given this list of URLs and the objective: {objective} 
    Rank the top 3 most relevant ones based on URL structure and content hints. 
    Return ONLY a JSON array of 3 objects (or fewer if less than 3 URLs available) with: 
    - "url": the URL string 
    - "relevance_score": number from 1-10 
    - "reason": brief explanation 
    URLs to rank: 
    {json.dumps(links, indent=2)} 
    Respond with valid JSON only. 
    """ 
    ranked_raw = llm.invoke(rank_prompt).strip() 
    # Clean up possible ``` 
    if ranked_raw.startswith("```json"): 
        ranked_raw = ranked_raw.split("``` 
    if ranked_raw.endswith("```"): 
        ranked_raw = ranked_raw.rsplit("``` 
    ranked_raw = ranked_raw.strip() 
    # Load the JSON (now will raise naturally if invalid) 
    ranked_results = json.loads(ranked_raw) 
    print(f"{Colors.MAGENTA}Ranked Results:{Colors.RESET}") 
    for i, r in enumerate(ranked_results): 
        print(f"{Colors.GREEN}{i+1}. URL: {r['url']}{Colors.RESET}") 
        print(f"   {Colors.YELLOW}Score: {r['relevance_score']}{Colors.RESET}") 
        print(f"   {Colors.BLUE}Reason: {r['reason']}{Colors.RESET}") 
        print("---") 
    return [r["url"] for r in ranked_results] 
Step 2 - Scraping & Extraction 

We loop through top-ranked URLs, scrape contents, and pass them to Watsonx for relevance validation & data extraction.

find_objective_in_top_pages_fixed(links, objective, app, llm): 

  • Scrapes each top link.
  • WatsonX analyses content, extracts or rejects based on the objective.
  • Returns a structured JSON with findings or a raw fallback. 
def find_objective_in_top_pages_fixed(links, objective, app, llm): 
    """Fixed version with better content handling (no try/except)""" 
    for i, link in enumerate(links): 
        print(f"{Colors.YELLOW}[{i+1}/{len(links)}] Scraping: {link}{Colors.RESET}") 
        # Use updated Firecrawl API - remove params parameter 
        scrape_result = app.scrape_url(link, formats=['markdown']) 
        # Handle different response formats 
        if hasattr(scrape_result, 'markdown'): 
            content = scrape_result.markdown 
        elif hasattr(scrape_result, 'content'): 
            content = scrape_result.content 
        elif isinstance(scrape_result, dict): 
            content = ( 
                scrape_result.get('markdown') or  
                scrape_result.get('content') or  
                scrape_result.get('data', {}).get('markdown') or 
                str(scrape_result) 
            ) 
        else: 
            content = str(scrape_result) 
        # Limit content length to avoid token limits 
        if len(content) > 4000: 
            content = content[:4000] + "..." 
        print(f"{Colors.BLUE}Content length: {len(content)} characters{Colors.RESET}") 
        # Build LLM analysis prompt 
        check_prompt = f""" 
        Analyze the following content to determine if it contains information relevant to the objective. 
        Objective: {objective} 
        Content: 
        {content} 
        Instructions: 
        - If the content does NOT contain information relevant to the objective, respond with exactly: "Objective not met" 
        - If the content DOES contain relevant information, extract it and return a JSON object with the findings 
        Response:""" 
        # Get response from LLM 
        result_raw = llm.invoke(check_prompt).strip() 
        # Clean up Markdown fenced code block if present 
        if result_raw.startswith("``` 
            result_raw = result_raw.split("```json", 1)[1] 
        if result_raw.endswith("``` 
            result_raw = result_raw.rsplit("```", 1)[0] 
        result_raw = result_raw.strip() 
        print(f"{Colors.CYAN}LLM Response: {result_raw[:200]}...{Colors.RESET}") 
        # Check if LLM says objective not met 
        if "objective not met" in result_raw.lower(): 
            print(f"{Colors.YELLOW}Objective not met in this page{Colors.RESET}") 
            continue 
        # Try direct JSON parsing (will raise naturally if invalid) 
        json_result = json.loads(result_raw) 
        print(f"{Colors.GREEN} Objective met! Found relevant information.{Colors.RESET}") 
        return json_result 
    return None 

Step 3 - Orchestration (Main Execution)

run_intelligent_scraper(url, objective) glues the workflow together fully automated! 

def run_intelligent_scraper(url, objective): 
    """Main orchestrator.""" 
    print(f"{Colors.MAGENTA} Running AI Scraper{Colors.RESET}") 
    top_links = find_relevant_page_via_map_fixed(objective, url, app, llm) 
    result = find_objective_in_top_pages_fixed(links, objective, app, llm) 
    if result: 
        print(f"{Colors.GREEN} Found Data:{Colors.RESET}") 
        print(json.dumps(result, indent=2)) 
        return result 
    else: 
        print(f"{Colors.RED} No relevant data found.{Colors.RESET}") 
        return None 

# Example Run 
if name == "main": 
    url = "https://en.wikipedia.org/wiki/OpenAI" 
    objective = "find founding date" 
    run_intelligent_scraper(url, objective) 

 

Real Example Flow

When running:  

python scraper.py 

You might see: 

  1. WatsonX generates keyword: "founded"
  2. Firecrawl maps related Wikipedia sections
  3. WatsonX ranks ['History_of_OpenAI', 'About', 'Founding']
  4. Firecrawl scrapes these, WatsonX extracts: 

OUTPUT

Step 1: Finding relevant pages... 

Objective: find founding date 

Search keyword: founded 

Testing Firecrawl map with URL: https://en.wikipedia.org/wiki/OpenAI 

Search parameter: founded 

Testing map without search parameter... 

Map result (no search): <class 'firecrawl.firecrawl.MapResponse'> 

Result structure: success=True links=['https://en.wikipedia.org/wiki/OpenAI'] error=None 

Testing map with search parameter... 

Map result (with search): <class 'firecrawl.firecrawl.MapResponse'> 

Result structure: success=True links=['https://en.wikipedia.org/wiki/OpenAI'] error=None 

Trying to extract from response objects 

Found 1 links

Ranked Results: 

1. URL: https://en.wikipedia.org/wiki/OpenAI 

   Score: 9 

   Reason: Wikipedia pages often contain detailed historical information, including founding dates, in the infobox or early sections of the article. 

--- 

Step 2: Analyzing pages for objective... 

[1/1] Scraping: https://en.wikipedia.org/wiki/OpenAI 

Content length: 4003 characters 

LLM Response: { 

                    "founding_date": "December 8, 2015" 

                }... 

Objective met! Found relevant information. 

Success! Objective completed. 

Final Result: 

  "founding_date": "December 8, 2015" 

Key Features of the Fixed Code

  • AI-powered link selection: Watsonx generates keywords and ranks pages.
  • Flexible scraping: Firecrawl adapts to HTML, Markdown, or JSON formats.
  • Context-aware extraction: Only returns on-topic information.
  • Robust error handling: Gracefully falls back when mapping/scraping fails. 

Potential UseCase 

  • Corporate Fact Extraction (e.g., founding dates, board members)
  • Contact Info Discovery across diverse web properties
  • Automated Knowledge Base Updates
  • Intelligent Monitoring & Alerting for published news, docs, or reports 

Dive deeper into error handling, alternate objectives, and multi-page scraping by exploring the Firecrawl API docs and Watsonx LLM parameter tuning. 

 

Final Thoughts 

Integrating Watsonx's reasoning abilities with Firecrawl's powerful mapping/scraping transforms traditional web automation. You get a workflow that's: 

  • Adaptive: Finds the best links and info, even from complex sites.
  • Scalable: Slices through thousands of pages with minimal manual oversight.
  • Extensible: Ready for plug-in objectives or new content types. 

Try out the sample code, adapt it for your own knowledge extraction needs, and unleash the next level of web automation with IBM WatsonX and Firecrawl. 

 

Authors: 

#GenerativeAI   #IBMWatsonx    #WebScraping   #IntelligentAutomation

#AIEngineering   #Python  #EnterpriseAI    #KnowledgeExtraction


#community-stories2
0 comments
19 views

Permalink