Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Limitation

Description

Maximum characters per page

Verity processes only the first 20,000 characters on any page in any supported language. Note that, according to the Verity team’s research, the majority of web pages are under 7,500 characters per page. Few pages exceed the 20,000 character limitation.

Insufficient Content

Where Verity’s content extraction processes cannot extract sufficient relevant content from a page (typically 50 text characters or less), Verity is unable adequately perform classification tasks across text and imagery. An error message INSUFFICIENT_CONTENT is returned. The benefit of excluding insufficient content from Verity analysis is that classifications are only made based on meaningful amounts of data, enabling increased accuracy across all classes. 

Infinite scrolling pages

Infinite scrolling enables users to keep scrolling through information on a web page, without clicking a “Load More” or “Next Page” option. Many platforms, such as espn.com, have implemented Infinite Scrolling, as information loads quickly and maintains user engagement. In many Infinite Scrolling environments, each component page of the Infinite Scroll page has its own URL and the URL changes as the content is loaded. As Verity has a 20,000 character maximum limit and only processes page URLs that are specifically requested by the partner, Verity typically does not process the complete content of an Infinite Scrolling page.

Dynamically rendered pages

Dynamic web pages contain content that is generated automatically from a web server via Javascript, instead of being hard-coded on the page. The content of the page may change based on multiple variables, for example, new data on the web server or user selection. The content of these page can only be reliably discovered by rendering the page. Verity therefore does not attempt to classify dynamically rendered pages.

Home pages

Home pages for a site may have more complicated layouts than the main corpus of the site content and often contain text passages quoted from other pages on the site. Verity’s contextual categorization of home page content may therefore be less useful than the classification of other pages on the site.

Intricate page layouts

Some sites may implement complex HTML and CSS schemes that may require rendering to reveal the main body text of the pages. These design practices are not typically employed by established publishers and therefore rarely impede Verity content extraction.

User Generated Content (UGC)

Verity does not process or analyze UGC, such as Comments or Social Media posts. UGC is constantly changing, therefore Verity does not attempt to provide a UGC content classification that could immediately become outdated.

...