Skip to main content
Version: 6.0.0

Web Connector

Overviewโ€‹

The Web Connector is a powerful tool that allows you to easily extract webpage data by simply providing a URL and choosing from three different scraping methods. Whether you need data from a single page, a network of interconnected pages, or a well-structured sitemap, this tool can handle it all. The extracted data can be utilized within our products ACE Search and Chat.

Table of Contentsโ€‹

Getting Startedโ€‹

1. Providing a URLโ€‹

The first step in using the web scraper is to provide the URL of the webpage you want to scrape. This URL will serve as the starting point for the scraping process, regardless of which method you choose.

2. Choosing a Scraping Methodโ€‹

Once you've provided the URL, you can choose one of the following scraping methods:

A. Single Page Scrapingโ€‹

  • Description: This method extracts all the data from the specific webpage you provided. It's the simplest and quickest option, ideal for when you only need content from a single page.
  • How It Works: The scraper fetches and processes all the text from the webpage without following any further links.
  • Best For: Blog posts, articles, product pages, or any standalone content.

B. Recursive Scrapingโ€‹

  • Description: This method is designed to go deeper into the web of interconnected pages. It starts from the provided URL and follows links to other pages, capturing content up to two levels deep (our current default setting). This is useful for gathering content from related pages or sections.
  • How It Works:
    • The scraper begins with the provided URL
    • It extracts content from that page
    • It identifies links on the page
    • It follows these links and extracts content from the linked pages
  • Best For: Documentation websites, knowledge bases, multi-page articles, or related content sections.
  • Limitations:
    • Only internal links (same domain) are followed
    • Default depth is limited to two levels to prevent excessive crawling

C. Sitemap Scrapingโ€‹

  • Description: This method utilizes the website's sitemap, a structured XML file that lists all the pages on a website, to methodically scrape content from multiple pages.
  • How It Works:
    • The user provides the URL of the sitemap (usually something like https://example.com/sitemap.xml)
    • The connector parses the sitemap to identify all listed URLs
    • It then sequentially processes each URL listed in the sitemap, extracting content from each webpage
  • Best For: Complete website scraping, e-commerce sites, documentation pages, or any website with a structured sitemap that outlines all the available pages.
  • Advantage: Most comprehensive and systematic approach to capturing all content from a website.

Setting Up a Scheduleโ€‹

The Web Connector allows you to set up recurring scraping schedules to keep your data up-to-date:

  • Enable Recording: Toggle this option on to activate scheduled scraping
  • Frequency: Choose how often the scraper should run:
    • Daily
    • Weekly
    • Monthly
  • Value: Specify the interval (e.g., every 1 day, every 7 days, etc.)

Permission Managementโ€‹

You can control who has access to the scraped data by:

  • Selecting user groups from the permissions dropdown
  • Multiple groups can be granted access as needed
  • Only users in the selected groups will be able to see and interact with this knowledge source

Status Indicatorsโ€‹

After adding a web source, the system displays status information:

StatusMeaning
EnableConfiguration complete but data processing not yet finished
SuccessData has been scraped and indexed successfully
ErrorIssue encountered during scraping or indexing

Best Practicesโ€‹

For optimal results with the Web Connector:

  • Start Small: Begin with single page scraping to test the quality of extracted content before moving to recursive or sitemap methods
  • Check Robots.txt: Ensure the website allows scraping by checking its robots.txt file
  • Respect Rate Limits: Don't schedule scraping too frequently to avoid being blocked
  • Use Descriptive Names: Name your knowledge sources clearly for easy identification
  • Review After Scraping: Always check the extracted content quality after the initial scraping

Troubleshootingโ€‹

IssueSolution
No content extractedVerify the URL is accessible and contains text content
Missing expected pagesFor sitemap scraping, ensure the sitemap is complete and up-to-date
Rate limiting errorsReduce scraping frequency or contact site administrator
Incomplete contentTry a different scraping method or check if content is loaded dynamically
Extraction errorsVerify the website doesn't use anti-scraping techniques

By following this guide, you can effectively use the Web Connector to extract valuable data from websites and make it available for search and retrieval within the platform.