10 Web Scraping Problems And How to Fix Them

March 5, 2025

10 Web Scraping Problems And How to Fix Them

If you’ve been using web scraping for e-commerce for a while, you’ve probably come across web scraping problems such as IP blocking, failed data retrieval, slow-running scrapers, and frequent instances of incorrect or missing data. However, if you understand the root cause of each problem, you can resolve them all. The guide below is designed to help you do just that.

Most web scraping problems come from 4 core areas: anti-bot systems, poor scraper architecture, low data quality, and legal constraints and the most common issues include: IP bans, JavaScript rendering, DOM changes, duplicate data, and rate limiting

The root cause is rarely the code itself, but how the scraping system is designed

To fix these problems:

Simulate real browser behavior (don’t act like a bot)

Design a clear pipeline (crawl → parse → store)

Clean and validate data from the start

At scale, building your own scraper becomes hard to maintain and costly compared to using a professional solution

In short: reliable web scraping depends on system design, not just code.

Overview of Common Web Scraping Problems

Based on experience scraping e-commerce websites for multiple real-world projects in Southeast Asia, Easy Data has observed that web scraping problems usually revolve around 4 main groups, each affecting the system in its own way:

Website-side issues: anti-bot systems, JavaScript rendering, geo-restrictions
Scraper architecture issues: scaling, performance, concurrency
Data issues: duplicates, missing fields, incorrect formats
Legal factors: Terms of Service, private data

Web Scraping Problem Group	Difficulty	When It Happens	How It Affects	Recommended Approach
Website-level issues	High	When crawling dynamic or protected websites (Shopee, Lazada, TikTok Shop)	Data cannot be retrieved or scraper gets blocked → pipeline stops	Simulate real browser behavior, render JavaScript, control crawl rate
Scraper architecture issues	High	When moving from small test to large-scale crawling	System becomes slow, unstable, or fails at scale → cannot maintain data flow	Use async scraping, queues, and separate pipeline stages
Data-related issues	Medium	After scraping raw data from multiple pages or sources	Data becomes unreliable → leads to incorrect analysis and decisions	Normalize, deduplicate, validate schema before storing
Legal & compliance issues	Medium	When scraping without clear rules or governance	Risk of being blocked or violating policies → long-term operational risk	Only scrape public data, follow ToS, define clear data usage scope

Understanding these groups helps you predict issues before they happen, design the system correctly from the beginning, and most importantly, avoid having to “fix the same issue again and again.”

Top 10 Web Scraping Problems And How to Fix Them

During the process of web scraping, issues can appear at any stage. The key is to identify the problem correctly and handle it in the right way.

The guide below summarizes the 10 most common web scraping problems, following the actual workflow of a real scraping system. You can use it as a checklist to debug your current system or as a foundation when designing a pipeline from scratch.

10 most common web scraping problems in 2026

1. Anti-Bot Systems (IP Ban, CAPTCHA, Fingerprinting)

When you first run it, everything is usually quite smooth: data comes in steadily, requests are stable, and there are no significant errors.

But after a short time:

Requests start to fail
The page asks for CAPTCHA
Or worse, your IP gets completely blocked

The easiest sign to recognize: the code still runs normally, but you can no longer get data because the website has “recognized that you are not a real user”.

Causes:

Sending requests too fast → looks like a bot
Accessing too many URLs continuously
Not following robots.txt
Request headers don’t look like a real browser
Accessing data that requires login
Using headless browser, but fingerprint is exposed

How to handle:

Reduce crawl speed, don’t spam requests
Add retry + reasonable delay
Make requests more “human-like” (headers, behavior)
Prioritize scraping public data only
Respect robots.txt and protection mechanisms

Example code: check robots.txt before scraping


Copied!from urllib.robotparser import RobotFileParser

def can_scrape(base_url: str, target_url: str, user_agent: str) -> bool:
    robots_parser = RobotFileParser()
    robots_parser.set_url(f"{base_url}/robots.txt")
    robots_parser.read()

    return robots_parser.can_fetch(user_agent, target_url)

base_url = "https://example.com"
target_url = "https://example.com/products"
user_agent = "MyResearchBot"

if can_scrape(base_url, target_url, user_agent):
    print("The URL is allowed for crawling.")
else:
    print("The URL is blocked by robots.txt.")
from urllib.robotparser import RobotFileParser

def can_scrape(base_url: str, target_url: str, user_agent: str) -> bool:
    robots_parser = RobotFileParser()
    robots_parser.set_url(f"{base_url}/robots.txt")
    robots_parser.read()

    return robots_parser.can_fetch(user_agent, target_url)

base_url = "https://example.com"
target_url = "https://example.com/products"
user_agent = "MyResearchBot"

if can_scrape(base_url, target_url, user_agent):
    print("The URL is allowed for crawling.")
else:
    print("The URL is blocked by robots.txt.")

Example code: polite request with delay and retry


Copied!import time
import random
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()

retry_strategy = Retry(
    total=3,
    backoff_factor=2,
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["GET"]
)

adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)

session.headers.update({
    "User-Agent": "MyResearchBot/1.0 (+contact@example.com)",
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US,en;q=0.9"
})

def fetch_html(url: str):
    time.sleep(random.uniform(2, 5))

    response = session.get(url, timeout=15)

    if response.status_code == 429:
        print("Rate limited. Reduce crawling speed.")
        return None

    response.raise_for_status()
    return response.text
import time
import random
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()

retry_strategy = Retry(
    total=3,
    backoff_factor=2,
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["GET"]
)

adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)

session.headers.update({
    "User-Agent": "MyResearchBot/1.0 (+contact@example.com)",
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US,en;q=0.9"
})

def fetch_html(url: str):
    time.sleep(random.uniform(2, 5))

    response = session.get(url, timeout=15)

    if response.status_code == 429:
        print("Rate limited. Reduce crawling speed.")
        return None

    response.raise_for_status()
    return response.text

2. JavaScript Rendering & Dynamic Content

A situation that easily makes you “doubt yourself”: you inspect and see full data, but when running the scraper, nothing shows up.

At this point, many people think the code is wrong, but the real issue lies in how the website works.

Returned HTML is almost empty
No product name, no price
But in the browser everything still displays normally

Causes:

Website uses React / Vue
Data loads after page render
Content only appears when scrolling
DOM changes after load

How to handle:

Use Playwright / Selenium
Wait for elements to appear (don’t use hard sleep)
Get HTML after rendering is complete
Parse the rendered HTML

Example code: render the website with Playwright and extract HTML


Copied!from playwright.sync_api import sync_playwright

def get_rendered_html(url: str) -> str:
    with sync_playwright() as playwright:
        browser = playwright.chromium.launch(headless=True)

        page = browser.new_page(
            user_agent="MyResearchBot/1.0 (+contact@example.com)"
        )

        page.goto(url, wait_until="networkidle", timeout=60000)

        rendered_html = page.content()

        browser.close()
        return rendered_html
from playwright.sync_api import sync_playwright

def get_rendered_html(url: str) -> str:
    with sync_playwright() as playwright:
        browser = playwright.chromium.launch(headless=True)

        page = browser.new_page(
            user_agent="MyResearchBot/1.0 (+contact@example.com)"
        )

        page.goto(url, wait_until="networkidle", timeout=60000)

        rendered_html = page.content()

        browser.close()
        return rendered_html

Example code: parse rendered HTML


Copied!from bs4 import BeautifulSoup

html = get_rendered_html("https://example.com/products")
soup = BeautifulSoup(html, "html.parser")

products = []

for product_card in soup.select(".product-card"):
    name_element = product_card.select_one(".product-name")
    price_element = product_card.select_one(".product-price")

    products.append({
        "name": name_element.get_text(strip=True) if name_element else None,
        "price": price_element.get_text(strip=True) if price_element else None
    })

print(products)
from bs4 import BeautifulSoup

html = get_rendered_html("https://example.com/products")
soup = BeautifulSoup(html, "html.parser")

products = []

for product_card in soup.select(".product-card"):
    name_element = product_card.select_one(".product-name")
    price_element = product_card.select_one(".product-price")

    products.append({
        "name": name_element.get_text(strip=True) if name_element else None,
        "price": price_element.get_text(strip=True) if price_element else None
    })

print(products)

3. DOM Structure Changes

This is a very “annoying” web scraping problem because it doesn’t show clear errors.

No crash, no exception, but data gradually becomes incorrect (e.g., price shifts, missing product name, or returning all “none”). Just a small frontend change can make all your selectors “useless” immediately.

Causes:

Website updates UI
A/B testing
CSS class changes
Elements get renamed

How to handle:

Avoid depending on “random” classes
Use more stable selectors
Add fallback selectors
Monitor output to detect early

Example code: fallback selector


Copied!from bs4 import BeautifulSoup

def extract_product_name(html: str):
    soup = BeautifulSoup(html, "html.parser")

    selectors = [
        '[data-testid="product-title"]',
        "h1.product-title",
        "h1",
        'meta[property="og:title"]'
    ]

    for selector in selectors:
        element = soup.select_one(selector)

        if element:
            if element.name == "meta":
                return element.get("content", "").strip()

            return element.get_text(strip=True)

    return None
from bs4 import BeautifulSoup

def extract_product_name(html: str):
    soup = BeautifulSoup(html, "html.parser")

    selectors = [
        '[data-testid="product-title"]',
        "h1.product-title",
        "h1",
        'meta[property="og:title"]'
    ]

    for selector in selectors:
        element = soup.select_one(selector)

        if element:
            if element.name == "meta":
                return element.get("content", "").strip()

            return element.get_text(strip=True)

    return None

Example code: detect broken selectors


Copied!def validate_scraping_result(items):
    if not items:
        raise ValueError("No items found. The DOM structure may have changed.")

    items_without_names = [
        item for item in items
        if not item.get("name")
    ]

    missing_ratio = len(items_without_names) / len(items)

    if missing_ratio > 0.3:
        raise ValueError("Too many missing product names. The selector may be broken.")
def validate_scraping_result(items):
    if not items:
        raise ValueError("No items found. The DOM structure may have changed.")

    items_without_names = [
        item for item in items
        if not item.get("name")
    ]

    missing_ratio = len(items_without_names) / len(items)

    if missing_ratio > 0.3:
        raise ValueError("Too many missing product names. The selector may be broken.")

4. Scaling Systems

The most obvious sign: the more URLs you need to crawl, the slower the system becomes, timeouts happen continuously, and requests become unstable.

Causes:

Sequential requests
No queue
No retry
No status tracking

How to handle:

Use async to run in parallel
Limit concurrency (avoid being blocked)
Separate crawl/parse/store
Use queue to manage jobs

Example code: async HTML scraping with concurrency limits


Copied!import asyncio
import aiohttp

urls = [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3"
]

async def fetch_html(session, url):
    async with session.get(url, timeout=20) as response:
        response.raise_for_status()
        return await response.text()

async def main():
    headers = {
        "User-Agent": "MyResearchBot/1.0 (+contact@example.com)"
    }

    connector = aiohttp.TCPConnector(limit=5)

    async with aiohttp.ClientSession(
        headers=headers,
        connector=connector
    ) as session:
        tasks = [fetch_html(session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

        for url, result in zip(urls, results):
            if isinstance(result, Exception):
                print(f"Failed to fetch {url}: {result}")
            else:
                print(f"Fetched {url}: {len(result)} characters")

asyncio.run(main())
import asyncio
import aiohttp

urls = [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3"
]

async def fetch_html(session, url):
    async with session.get(url, timeout=20) as response:
        response.raise_for_status()
        return await response.text()

async def main():
    headers = {
        "User-Agent": "MyResearchBot/1.0 (+contact@example.com)"
    }

    connector = aiohttp.TCPConnector(limit=5)

    async with aiohttp.ClientSession(
        headers=headers,
        connector=connector
    ) as session:
        tasks = [fetch_html(session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

        for url, result in zip(urls, results):
            if isinstance(result, Exception):
                print(f"Failed to fetch {url}: {result}")
            else:
                print(f"Fetched {url}: {len(result)} characters")

asyncio.run(main())

5. Legal & Website Policy Constraints

Many teams only focus on “getting the data” and forget about compliance, leading to being blocked, limited data access, or worse, not being able to continue scraping.

Causes:

Violating Terms of Service
Scraping private data
Not checking robots.txt

How to handle:

Check robots.txt
Read the website’s Terms of Service
Avoid scraping personal data without a valid legal basis
Avoid login-protected content without permission
Do not bypass CAPTCHA, paywalls, or access controls
Crawl only public and permitted content

Example code: filter URLs before adding them to the crawl queue


Copied!from urllib.robotparser import RobotFileParser

class RobotsChecker:
    def __init__(self, base_url: str, user_agent: str):
        self.base_url = base_url.rstrip("/")
        self.user_agent = user_agent

        self.parser = RobotFileParser()
        self.parser.set_url(f"{self.base_url}/robots.txt")
        self.parser.read()

    def is_allowed(self, url: str) -> bool:
        return self.parser.can_fetch(self.user_agent, url)

robots_checker = RobotsChecker(
    base_url="https://example.com",
    user_agent="MyResearchBot"
)

url = "https://example.com/products"

if robots_checker.is_allowed(url):
    print("Add URL to the crawl queue.")
else:
    print("Skip URL because it is not allowed by robots.txt.")
from urllib.robotparser import RobotFileParser

class RobotsChecker:
    def __init__(self, base_url: str, user_agent: str):
        self.base_url = base_url.rstrip("/")
        self.user_agent = user_agent

        self.parser = RobotFileParser()
        self.parser.set_url(f"{self.base_url}/robots.txt")
        self.parser.read()

    def is_allowed(self, url: str) -> bool:
        return self.parser.can_fetch(self.user_agent, url)

robots_checker = RobotsChecker(
    base_url="https://example.com",
    user_agent="MyResearchBot"
)

url = "https://example.com/products"

if robots_checker.is_allowed(url):
    print("Add URL to the crawl queue.")
else:
    print("Skip URL because it is not allowed by robots.txt.")

6. Duplicate Data

This web scraping problem is very common but often overlooked. The dataset may look complete, but when analyzing carefully:

One product appears multiple times
Average price becomes incorrect
Item count is inflated

This directly affects decisions based on data, especially in pricing or competitor tracking.

Causes:

Pagination
URLs with tracking params
Multiple URLs for the same product

How to handle:

Normalize URLs
Deduplicate by ID
Remove tracking params

Example code: normalize URLs


Copied!from urllib.parse import urlparse, parse_qsl, urlencode, urlunparse

TRACKING_PARAMETERS = {
    "utm_source",
    "utm_medium",
    "utm_campaign",
    "utm_term",
    "utm_content",
    "fbclid",
    "spm"
}

def normalize_url(url: str) -> str:
    parsed_url = urlparse(url)

    filtered_query = [
        (key, value)
        for key, value in parse_qsl(parsed_url.query)
        if key not in TRACKING_PARAMETERS
    ]

    normalized_url = parsed_url._replace(
        scheme=parsed_url.scheme.lower(),
        netloc=parsed_url.netloc.lower(),
        query=urlencode(filtered_query),
        fragment=""
    )

    return urlunparse(normalized_url)
from urllib.parse import urlparse, parse_qsl, urlencode, urlunparse

TRACKING_PARAMETERS = {
    "utm_source",
    "utm_medium",
    "utm_campaign",
    "utm_term",
    "utm_content",
    "fbclid",
    "spm"
}

def normalize_url(url: str) -> str:
    parsed_url = urlparse(url)

    filtered_query = [
        (key, value)
        for key, value in parse_qsl(parsed_url.query)
        if key not in TRACKING_PARAMETERS
    ]

    normalized_url = parsed_url._replace(
        scheme=parsed_url.scheme.lower(),
        netloc=parsed_url.netloc.lower(),
        query=urlencode(filtered_query),
        fragment=""
    )

    return urlunparse(normalized_url)

Example code: deduplicate items


Copied!seen_items = set()

def is_new_item(item):
    product_id = item.get("product_id")
    product_url = normalize_url(item.get("url", ""))

    dedupe_key = product_id or product_url

    if dedupe_key in seen_items:
        return False

    seen_items.add(dedupe_key)
    return True
seen_items = set()

def is_new_item(item):
    product_id = item.get("product_id")
    product_url = normalize_url(item.get("url", ""))

    dedupe_key = product_id or product_url

    if dedupe_key in seen_items:
        return False

    seen_items.add(dedupe_key)
    return True

7. Geo-Restricted Content

The sign is very clear: the same URL returns different data each time you scrape. For example:

Different prices by country
Products only appear in certain regions
Language, currency change

Causes:

Website detects location
Region-based pricing
Language-based content

How to handle:

Clearly define the region when crawling
Store currency together
Compare data within the same region

Example code: store region metadata


Copied!from datetime import datetime, timezone

def build_product_record(product, region, language):
    return {
        "product_id": product["product_id"],
        "name": product["name"],
        "price": product["price"],
        "currency": product["currency"],
        "region": region,
        "language": language,
        "scraped_at": datetime.now(timezone.utc).isoformat()
    }

record = build_product_record(
    product={
        "product_id": "SKU123",
        "name": "Wireless Mouse",
        "price": 199000,
        "currency": "VND"
    },
    region="VN",
    language="en-US"
)

print(record)
from datetime import datetime, timezone

def build_product_record(product, region, language):
    return {
        "product_id": product["product_id"],
        "name": product["name"],
        "price": product["price"],
        "currency": product["currency"],
        "region": region,
        "language": language,
        "scraped_at": datetime.now(timezone.utc).isoformat()
    }

record = build_product_record(
    product={
        "product_id": "SKU123",
        "name": "Wireless Mouse",
        "price": 199000,
        "currency": "VND"
    },
    region="VN",
    language="en-US"
)

print(record)

8. Performance & Rate Limiting

If after some time scraping becomes slower, you frequently get HTTP 429, more timeouts, or inconsistent data, your system is being rate-limited or overloaded.

Causes:

Too many requests
No caching
Retry too fast

How to handle:

Limit crawling speed
Use exponential backoff
Cache HTML responses
Re-crawl only necessary URLs
Prioritize important URLs

Example code: retry with exponential backoff


Copied!import time
import random
import requests

def fetch_with_backoff(url, max_attempts=4):
    for attempt in range(1, max_attempts + 1):
        try:
            response = requests.get(
                url,
                timeout=15,
                headers={
                    "User-Agent": "MyResearchBot/1.0 (+contact@example.com)"
                }
            )

            if response.status_code == 429:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Waiting {wait_time:.2f} seconds.")
                time.sleep(wait_time)
                continue

            response.raise_for_status()
            return response.text

        except requests.RequestException as error:
            if attempt == max_attempts:
                raise error

            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Request failed. Retrying in {wait_time:.2f} seconds.")
            time.sleep(wait_time)

    return None
import time
import random
import requests

def fetch_with_backoff(url, max_attempts=4):
    for attempt in range(1, max_attempts + 1):
        try:
            response = requests.get(
                url,
                timeout=15,
                headers={
                    "User-Agent": "MyResearchBot/1.0 (+contact@example.com)"
                }
            )

            if response.status_code == 429:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Waiting {wait_time:.2f} seconds.")
                time.sleep(wait_time)
                continue

            response.raise_for_status()
            return response.text

        except requests.RequestException as error:
            if attempt == max_attempts:
                raise error

            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Request failed. Retrying in {wait_time:.2f} seconds.")
            time.sleep(wait_time)

    return None

Example code: cache HTML responses


Copied!from requests_cache import CachedSession

session = CachedSession(
    cache_name="html_cache",
    expire_after=3600
)

response = session.get(
    "https://example.com/products",
    headers={
        "User-Agent": "MyResearchBot/1.0 (+contact@example.com)"
    }
)

print("Loaded from cache:", response.from_cache)

html = response.text
from requests_cache import CachedSession

session = CachedSession(
    cache_name="html_cache",
    expire_after=3600
)

response = session.get(
    "https://example.com/products",
    headers={
        "User-Agent": "MyResearchBot/1.0 (+contact@example.com)"
    }
)

print("Loaded from cache:", response.from_cache)

html = response.text

9. Complex HTML Structures

Not every website “lays out” data in an easy-to-scrape way. When facing complex HTML:

Selectors seem correct
Code runs normally
But data is wrong, missing, or taken from the wrong place

Causes:

Too many nested layers → selector becomes inaccurate
Repeated components → easy to select the wrong element
Real data is in script/JSON, but you scrape HTML
DOM changes based on state (hover, click, load more…)

How to handle:

Inspect carefully multiple times, not just one element
Prefer extracting from JSON in <script> or API (if available)
Write more “reliable” selectors
Validate data after scraping (don’t trust 100%)

Example code: parse JSON-LD from HTML


Copied!import json
from bs4 import BeautifulSoup

def extract_json_ld(html: str):
    soup = BeautifulSoup(html, "html.parser")
    script_elements = soup.select('script[type="application/ld+json"]')

    structured_data = []

    for script_element in script_elements:
        try:
            data = json.loads(script_element.string)
            structured_data.append(data)
        except (TypeError, json.JSONDecodeError):
            continue

    return structured_data
import json
from bs4 import BeautifulSoup

def extract_json_ld(html: str):
    soup = BeautifulSoup(html, "html.parser")
    script_elements = soup.select('script[type="application/ld+json"]')

    structured_data = []

    for script_element in script_elements:
        try:
            data = json.loads(script_element.string)
            structured_data.append(data)
        except (TypeError, json.JSONDecodeError):
            continue

    return structured_data

Example code: extract product cards from nested HTML


Copied!from bs4 import BeautifulSoup

def extract_products(html: str):
    soup = BeautifulSoup(html, "html.parser")

    products = []

    for product_card in soup.select(".product-card"):
        name_element = product_card.select_one(".product-card__name")
        price_element = product_card.select_one(".product-card__price")
        link_element = product_card.select_one("a[href]")

        products.append({
            "name": name_element.get_text(strip=True) if name_element else None,
            "price": price_element.get_text(strip=True) if price_element else None,
            "url": link_element["href"] if link_element else None
        })

    return products
from bs4 import BeautifulSoup

def extract_products(html: str):
    soup = BeautifulSoup(html, "html.parser")

    products = []

    for product_card in soup.select(".product-card"):
        name_element = product_card.select_one(".product-card__name")
        price_element = product_card.select_one(".product-card__price")
        link_element = product_card.select_one("a[href]")

        products.append({
            "name": name_element.get_text(strip=True) if name_element else None,
            "price": price_element.get_text(strip=True) if price_element else None,
            "url": link_element["href"] if link_element else None
        })

    return products

10. Data Cleaning & Storage Issues

Scraping data does not mean you already have “usable” data. Raw data is usually very “dirty”. Examples:

Price: “1.200.000đ” → cannot calculate immediately
Product name: extra spaces, strange characters
Fields sometimes exist, sometimes not, inconsistent format

If not handled carefully, this web scraping problem will strongly affect the final result: incorrect analysis, wrong calculations, wrong dashboard display.

Causes:

Different formats across pages
Website changes format without notice
No normalization step
Storing raw data without validation

How to handle:

Clean data right after scraping
Normalize format
Handle missing data
Validate before storing
Design clear storage

Example code: clean price and validate product data


Copied!import re
from pydantic import BaseModel, HttpUrl, ValidationError

class Product(BaseModel):
    product_id: str
    name: str
    price: int
    url: HttpUrl
    source: str

def clean_price(price_text: str) -> int:
    digits_only = re.sub(r"[^\d]", "", price_text)

    if not digits_only:
        raise ValueError("Invalid price value.")

    return int(digits_only)

raw_item = {
    "product_id": "SKU123",
    "name": " Wireless Mouse ",
    "price": "199,000 VND",
    "url": "https://example.com/product/sku123",
    "source": "example.com"
}

try:
    product = Product(
        product_id=raw_item["product_id"],
        name=raw_item["name"].strip(),
        price=clean_price(raw_item["price"]),
        url=raw_item["url"],
        source=raw_item["source"]
    )

    print(product.model_dump())

except ValidationError as error:
    print("Invalid product data:", error)
import re
from pydantic import BaseModel, HttpUrl, ValidationError

class Product(BaseModel):
    product_id: str
    name: str
    price: int
    url: HttpUrl
    source: str

def clean_price(price_text: str) -> int:
    digits_only = re.sub(r"[^\d]", "", price_text)

    if not digits_only:
        raise ValueError("Invalid price value.")

    return int(digits_only)

raw_item = {
    "product_id": "SKU123",
    "name": " Wireless Mouse ",
    "price": "199,000 VND",
    "url": "https://example.com/product/sku123",
    "source": "example.com"
}

try:
    product = Product(
        product_id=raw_item["product_id"],
        name=raw_item["name"].strip(),
        price=clean_price(raw_item["price"]),
        url=raw_item["url"],
        source=raw_item["source"]
    )

    print(product.model_dump())

except ValidationError as error:
    print("Invalid product data:", error)

Example code: store data in SQLite


Copied!import sqlite3

connection = sqlite3.connect("products.db")

connection.execute("""
CREATE TABLE IF NOT EXISTS products (
    product_id TEXT PRIMARY KEY,
    name TEXT NOT NULL,
    price INTEGER NOT NULL,
    url TEXT NOT NULL,
    source TEXT NOT NULL,
    scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
)
""")

def save_product(product):
    connection.execute("""
    INSERT INTO products (product_id, name, price, url, source)
    VALUES (?, ?, ?, ?, ?)
    ON CONFLICT(product_id) DO UPDATE SET
        name = excluded.name,
        price = excluded.price,
        url = excluded.url,
        source = excluded.source,
        scraped_at = CURRENT_TIMESTAMP
    """, (
        product["product_id"],
        product["name"],
        product["price"],
        str(product["url"]),
        product["source"]
    ))

    connection.commit()
import sqlite3

connection = sqlite3.connect("products.db")

connection.execute("""
CREATE TABLE IF NOT EXISTS products (
    product_id TEXT PRIMARY KEY,
    name TEXT NOT NULL,
    price INTEGER NOT NULL,
    url TEXT NOT NULL,
    source TEXT NOT NULL,
    scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
)
""")

def save_product(product):
    connection.execute("""
    INSERT INTO products (product_id, name, price, url, source)
    VALUES (?, ?, ?, ?, ?)
    ON CONFLICT(product_id) DO UPDATE SET
        name = excluded.name,
        price = excluded.price,
        url = excluded.url,
        source = excluded.source,
        scraped_at = CURRENT_TIMESTAMP
    """, (
        product["product_id"],
        product["name"],
        product["price"],
        str(product["url"]),
        product["source"]
    ))

    connection.commit()

How Web Scraping Problems Impact Business Decisions

Web scraping problems are not just technical issues. In practice, even a small error in your scraping pipeline can quietly affect how your business makes decisions. For example:

A team tracks competitor pricing to optimize their own pricing strategy. But the dataset contains duplicate entries → the average price is skewed → leading to incorrect pricing decisions.

Or:

Data is delayed by just 1–2 hours → the team misses a flash sale window → and loses a key competitive advantage.

The pattern is simple: If your data is not accurate or not timely, your decisions are unlikely to be correct.

When Should You Use Data Scraping Services?

If you are only working on a small project (few products, few keywords, low crawl frequency), building your own scraping system is entirely reasonable.

When should you use a web scraping service instead of building your own?

But when you start to scale, things change. Web scraping problems no longer happen “occasionally”, but start repeating continuously. At that point, the team spends more time fixing issues and keeping the system stable instead of actually using the data.

This is when you should consider a more professional solution.

Instead of building and fixing everything yourself, you can use e-commerce data scraping services from Easy Data, where the system is already designed to meet your needs:

Collect data from both web and app
Customize based on specific goals (product, price, keyword, search…)
Support multiple Southeast Asia markets (Vietnam, Thailand, Indonesia…)
Update data based on your schedule (hourly, daily…)

You no longer need to fix web scraping problems constantly. Instead, you get clean, standardized datasets ready for analysis and decision-making.

Final Thought

Web scraping in 2026 is no longer easy. Each web scraping problem is a potential “breaking point” in your system. But if you clearly understand where the problem happens, why it happens, and how to handle it correctly, you can absolutely build a stable and scalable scraping system.

And if you want to shorten implementation time and avoid constantly “putting out fires”, a more practical solution is to use a full-service data scraping provider.

Why does a scraper work at first but fail later?

Because websites can:

Detect bot behavior over time
Change their HTML structure
Apply rate limiting

=> A scraper must be designed to handle change, not just work once.

How can you avoid getting blocked when web scraping?

Common approaches include:

Reducing request speed
Using proxies
Simulating real browser behavior
Avoiding login-protected data

=> The key is to avoid acting like a bot.

Why is scraped data often incorrect or incomplete?

Common causes:

Incorrect selectors
Dynamic content not rendered
DOM structure changes
No data validation step

=> That’s why data cleaning and validation are critical.

What is the hardest part of web scraping?

Not the code itself, but:

Anti-bot systems
Scaling infrastructure
Maintaining data consistency

=> These are what make real-world scraping challenging.

10 Web Scraping Problems And How to Fix Them

Overview of Common Web Scraping Problems

Top 10 Web Scraping Problems And How to Fix Them

1. Anti-Bot Systems (IP Ban, CAPTCHA, Fingerprinting)

2. JavaScript Rendering & Dynamic Content

3. DOM Structure Changes

4. Scaling Systems

5. Legal & Website Policy Constraints

6. Duplicate Data

7. Geo-Restricted Content

8. Performance & Rate Limiting

9. Complex HTML Structures

10. Data Cleaning & Storage Issues

How Web Scraping Problems Impact Business Decisions

When Should You Use Data Scraping Services?

Final Thought

Why does a scraper work at first but fail later?

How can you avoid getting blocked when web scraping?

Why is scraped data often incorrect or incomplete?

What is the hardest part of web scraping?

Leave a Reply Cancel reply