# Web Data Extraction

## Overview

<figure><img src="https://3372553292-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FwSmFzVcnCrIF1wbU8Chb%2Fuploads%2F48Bs4H4epX6PR8QJlt9v%2FGetURL.png?alt=media&#x26;token=25bc709e-76f8-47b8-ad03-6c002fdb6823" alt=""><figcaption><p>Get URL Block</p></figcaption></figure>

Web Data Extraction is a powerful block that allows you to extract metadata and full-page content from web pages. It is useful for a variety of applications, including web scraping, data mining, and content analysis.&#x20;

It is highly customizable, allowing you to specify the types of data you want to extract and the format in which you want to receive it.&#x20;

Whether you are a researcher, marketer, or data analyst, Web Data Extraction can help you extract valuable insights from the web.

Works best with web search components and knowledge extraction.

## How to Setup

1. Provide an input - URL data point. You can get it from another action block or manual input block.
2. Select the output data points you want to extract.

{% hint style="warning" %}
Web content that has **more than 4,000 characters** (around 1000 English words) can't be processed live during the workflow.
{% endhint %}

{% hint style="info" %}
To make using longer web page content possible, collect it to the "Documents" section first, and then perform "Internal Search" across it from the workflow.
{% endhint %}

## Inputs and Outputs

<table><thead><tr><th width="142">Input</th><th width="229.33333333333331">Output</th><th>Output Description</th></tr></thead><tbody><tr><td>Target link (URL)</td><td>Meta Title (Text)</td><td>Title of the web page</td></tr><tr><td></td><td>Meta Description (Text)</td><td>Description of the page</td></tr><tr><td></td><td>Meta Image (Image)</td><td>Social media image of the page</td></tr><tr><td></td><td>Full-page text  (Text)</td><td>Texts extracted from the full-page HTML, structured by headlines or paragraphs</td></tr><tr><td></td><td>Full page HTML (Text)</td><td>HTML of the target web page</td></tr><tr><td></td><td>Links to media on the page (URL) - Soon</td><td>Links to all the images or videos from the page</td></tr><tr><td></td><td>Links to other pages/websites on the page (URL) - Soon</td><td>Links to all other web pages from the target page</td></tr></tbody></table>
