Extract Content from HTML

Design Studio has six steps for extracting content from a tag in an HTML page:

  • The Extract action is used to extract text content from the tag, optionally including the HTML tags.
  • The Extract URL action is used to extract a URL from a tag attribute containing a URL, and make that URL absolute.
  • The Extract Tag Attribute action is used to extract the value of a tag attribute.
  • The Extract Target action is used to extract binary data such as images and PDF files, but it handles any kind of binary data.
  • The Extract Form Parameter action is used to extract a form parameter from a form URL in the found tag and then store its value in a variable.
  • The Extract Selected Option action is used to extract the selected option from a <select>-tag and then store it in a variable.

To reformat (or normalize) the extracted content, use the Extract and Extract Tag Attribute actions and configure data converters in the list.

The Extract from PDF action is used to extract text from a PDF document contained as binary data in a selected attribute. It extracts the data and produces an HTML page that contains the data in a structured form that lets your robot access the data. This action is used in an initial step before the actual data extraction, in which you may loop over the produced HTML and extract text.