...
The URLs are human-annotated for threat and contextual classifications using both individual annotators and data annotation platforms. The Verity team runs classification processes, checks the results, and determines remediation or enhancement steps.
Minimum Page Reporting Requirements
Web pages must meet certain minimum requirements in order for Verity to successfully process the page content:
The URL specified in the page request must be valid and meet these requirements:
Start with
http://
orhttps://
.Have a properly URL-encoded address.
Any request parameter values must be properly URL-encoded.
Verity must be able to download HTML from the page URL.
Verity will attempt to extract content for analysis from pages that meet the above requirements.
Verity’s content extraction function can successfully process a wide range of web page designs and HTML markup, however, some known issues exist that may impede the extraction of usable web page content.
Review the limitations detailed in the following sections:
Content Extraction Limitations
The following table summarizes some of the known issues Verity may encounter when downloading and extracting pages for analysis.
Limitation | Description |
---|---|
Maximum characters per page | Verity processes only the first 20,000 characters on any page in any supported language. Note that, according to the Verity team’s research, the majority of web pages are under 7,500 characters per page. Few pages exceed the 20,000 character limitation. |
Infinite scrolling pages | Infinite scrolling enables users to keep scrolling through information on a web page, without clicking a “Load More” or “Next Page” option. Many platforms, such as espn.com, have implemented Infinite Scrolling, as information loads quickly and maintains user engagement. In many Infinite Scrolling environments, each component page of the Infinite Scroll page has its own URL and the URL changes as the content is loaded. As Verity has a 20,000 character maximum limit and only processes page URLs that are specifically requested by the partner, Verity typically does not process the complete content of an Infinite Scrolling page. |
Dynamically rendered pages | Dynamic web pages contain content that is generated automatically from a web server via Javascript, instead of being hard-coded on the page. The content of the page may change based on multiple variables, for example, new data on the web server or user selection. The content of these page can only be reliably discovered by rendering the page. Verity therefore does not attempt to classify dynamically rendered pages. |
Home pages | Home pages for a site may have more complicated layouts than the main corpus of the site content and often contain text passages quoted from other pages on the site. Verity’s contextual categorization of home page content may therefore be less useful than the classification of other pages on the site. |
Intricate page layouts | Some sites may implement complex HTML and CSS schemes that may require rendering to reveal the main body text of the pages. These design practices are not typically employed by established publishers and therefore rarely impede Verity content extraction. |
Site Access Limitations
Partner restrictions on website access may limit Verity’s ability to download content. Typically, to bypass partner site restrictions, Verity partners configure their Allow lists enabling Verity user agents to access their content.
Limitation | Description |
---|---|
Websites with login required | Some websites may require user login before any content is displayed. In these cases, Verity will return an error and will not attempt to classify the content. However, in most cases partners add Verity user agents to their Allow list so this issue does not arise. |
Geographic content | Content is often tailored to a specific geographic market, for example for News, Sports, or Streaming sites. The site may be designed to effectively serve a local market, or to conform to region-specific regulations such as GDPR. Websites may automatically detect the a user’s geographic address based on their IP address and dynamically serve the content targeted to their region. Verity user agents run in the U.S.A., may be served content targeted to that market from these websites. However, most multi-national publishers run websites with country-specific domains for each nation they serve. Verity will classify the content of the country-specific page URL requested. |
Paywall | Many Publisher websites are protected by a paywall, and limit access to their content in various ways, such as:
Verity can often extract enough content from these page to successfully perform a classification, however in most cases the Publisher has added Verity user agents to their Allow list, so the paywall does not impact Verity. |
Rate limits | Web properties may want to reduce their exposure to DoS (Denial of Service) or bot attacks. Multiple requests within a short time span may trigger the website to block subsequent requests from Verity. In this case, Verity is unable to extract page content until the block is lifted. |
Robots.txt | A Robots.txt file may limit access to a site or parts of a site. The site may also limit the number of pages that can be downloaded (for example, only 10 pages per month). This may limit Verity’s ability to download content from the site. |
Fake Page Content | In theory a Publisher could set up a pages to return different content for a page URL, in order to manipulate Verity’s classification results. In practice, Verity has not encountered an issue of this kind. |
Text Ingestion Limitations
...