Version: 6.0.0

Web Connector

Overview

The Web Connector is a powerful tool that allows you to easily extract webpage data by simply providing a URL and choosing from three different scraping methods. Whether you need data from a single page, a network of interconnected pages, or a well-structured sitemap, this tool can handle it all. The extracted data can be utilized within our products ACE Search and Chat.

Web Connector

Getting Started

1. Providing a URL

The first step in using the web scraper is to provide the URL of the webpage you want to scrape. This URL will serve as the starting point for the scraping process, regardless of which method you choose.

2. Choosing a Scraping Method

Once you've provided the URL, you can choose one of the following scraping methods:

A. Single Page Scraping

Description: This method extracts all the data from the specific webpage you provided. It's the simplest and quickest option, ideal for when you only need content from a single page.
How It Works: The scraper fetches and processes all the text from the webpage without following any further links.
Best For: Blog posts, articles, product pages, or any standalone content.

B. Recursive Scraping

Description: This method is designed to go deeper into the web of interconnected pages. It starts from the provided URL and follows links to other pages, capturing content up to two levels deep (our current default setting). This is useful for gathering content from related pages or sections.
How It Works:
- The scraper begins with the provided URL
- It extracts content from that page
- It identifies links on the page
- It follows these links and extracts content from the linked pages
Best For: Documentation websites, knowledge bases, multi-page articles, or related content sections.
Limitations:
- Only internal links (same domain) are followed
- Default depth is limited to two levels to prevent excessive crawling

C. Sitemap Scraping

Description: This method utilizes the website's sitemap, a structured XML file that lists all the pages on a website, to methodically scrape content from multiple pages.
How It Works:
- The user provides the URL of the sitemap (usually something like https://example.com/sitemap.xml)
- The connector parses the sitemap to identify all listed URLs
- It then sequentially processes each URL listed in the sitemap, extracting content from each webpage
Best For: Complete website scraping, e-commerce sites, documentation pages, or any website with a structured sitemap that outlines all the available pages.
Advantage: Most comprehensive and systematic approach to capturing all content from a website.

Setting Up a Schedule

The Web Connector allows you to set up recurring scraping schedules to keep your data up-to-date:

Enable Recording: Toggle this option on to activate scheduled scraping
Frequency: Choose how often the scraper should run:
- Daily
- Weekly
- Monthly
Value: Specify the interval (e.g., every 1 day, every 7 days, etc.)

Permission Management

You can control who has access to the scraped data by:

Selecting user groups from the permissions dropdown
Multiple groups can be granted access as needed
Only users in the selected groups will be able to see and interact with this knowledge source

Status Indicators

After adding a web source, the system displays status information:

Status	Meaning
Enable	Configuration complete but data processing not yet finished
Success	Data has been scraped and indexed successfully
Error	Issue encountered during scraping or indexing

Best Practices

For optimal results with the Web Connector:

Start Small: Begin with single page scraping to test the quality of extracted content before moving to recursive or sitemap methods
Check Robots.txt: Ensure the website allows scraping by checking its robots.txt file
Respect Rate Limits: Don't schedule scraping too frequently to avoid being blocked
Use Descriptive Names: Name your knowledge sources clearly for easy identification
Review After Scraping: Always check the extracted content quality after the initial scraping

Troubleshooting

Issue	Solution
No content extracted	Verify the URL is accessible and contains text content
Missing expected pages	For sitemap scraping, ensure the sitemap is complete and up-to-date
Rate limiting errors	Reduce scraping frequency or contact site administrator
Incomplete content	Try a different scraping method or check if content is loaded dynamically
Extraction errors	Verify the website doesn't use anti-scraping techniques

By following this guide, you can effectively use the Web Connector to extract valuable data from websites and make it available for search and retrieval within the platform.

Overview​

Table of Contents​

Getting Started​

1. Providing a URL​

2. Choosing a Scraping Method​

A. Single Page Scraping​

B. Recursive Scraping​

C. Sitemap Scraping​

Setting Up a Schedule​

Permission Management​

Status Indicators​

Best Practices​

Troubleshooting​