What is PDF to HTML?

alldevices_lowres

Many people ask how to convert PDF to HTML.

But what does that really mean? We all know what a PDF is, but converting PDF to HTML means different things to different people. What do you think PDF to HTML means?

Do you just want to be able to view a PDF in a browser? Are you looking to convert your PDF into a flipbook? Do you want to have the look and feel of your PDF – text in place with fonts, shapes and images formatted on a web page?

Are you really asking to have the content of the PDF extracted so that it can be presented in a reflowable or responsive way depending on the device? Reading your digital magazine on a smartphone would be far easier if the articles have been extracted using some kind of PDF to HTML process.

Different interpretations of converting PDF to HTML

Here are the most common ideas of what converting PDF to HTML actually means:

“I want to get my publication across all devices”

One way to look at PDF to HTML is converting your magazine, newspaper or catalog into a digital replica that can be viewed on multiple devices, without using Flash. A lot of companies provide this service, though many still use flash for the desktop version, providing a very different experience on tablets and phones.

More progressive digital publishing solutions (Realview included) don’t use flash at all and provide a pure HTML solution when displaying digital replica editions as flipbooks. At a minimum, from simply uploading your PDF, you can get a flipbook digital replica that will display on any device with a browser – desktop, laptop, tablet and phone. All hyperlinks should automatically be active, and there should be easy options to embed your new digital publication on your website, send it in an email or share it through social media.

alldevices_lowres

Even when converting your publication from PDF into a flipbook, there are plenty of different approaches.

One approach (the most common) is to use a PDF converter to convert PDF to JPG  – each page is converted into an image so that your publication is presented as a series of images that can be viewed on any device that has a browser. HTML is used to manipulate those pages into what looks like a printed publication, although the result is just PDF to JPG – the text, images and other elements of the page are not broken down into their individual parts, each page is converted from PDF page into a JPG image. There may be a flip transition, with some flipbook software even provide a skeumorphic representation of the page turning and others may show a slide.

One drawback however, is that image based digital replica publications are hard to read on smaller devices such as mobile phones as the reader needs to zoom in and pan around in order to read. Sometimes the image resolution is just not high enough to read the smallest text on the page, particularly as screens get higher resolution, and pixel density increases such as on retina devices.

This method of PDF to HTML is fine for most applications as long as the page images are high enough resolution and of sufficient quality so that the text can be read easily and clearly when a reader has zoomed in. In some instances the download size of images may get quite big depending on the original size of the page, so this method may not be suitable for publishing on smartphones for some types of publications.

“I expect vector text and raster images”

Increasingly there is a move towards digital publishing solutions that segment the pages into text and images. This enables the text to be rendered using the original font from the PDF, ensuring the text is crisp and readable regardless of the zoom level.

The page is still displayed as it was intended in a replica form, and HTML is used to position the text and graphical elements.

This approach can be hit and miss, depending on the actual implementation used to not only convert the PDF to HTML, but also in displaying the page. HTML is rendered differently by different browsers, so sometimes the pages will not the look the same to everyone.

An example of this approach is the file format Scalable Vector Graphics or SVG. SVG uses vector based text and line art (non-photographic images) and raster based images for photographs and backgrounds. SVG has been supported by the popular browsers for many years, but due to the rendering differences it has never really gained popular support.

There are other methods of displaying vector text and images, Realview has perfected one approach, take a look at this example:

When you zoom in to this publication you can see that the text is crisp and clear at all times, and the images are good quality and resolution. The best part if that the total file size of each page is far less that the original print ready PDF.

“I want my PDF converted into a re-flowable solution”

Sometimes when people think about converting PDF to HTML, what they really mean is breaking the PDF down so that it can be re-purposed in ways other than a digital replica page. In other words, identifying the different elements on the page (text, fonts, line art, images, shapes etc), the relationship they have with each other, and then storing them for later retrieval and re-assembly in different ways.

The challenge of this approach is how to keep the style associated with the page when the content is extracted, and subsequently, how to re-apply the styles on different devices in order to reflect the original look and feel of the page.

Whilst the previous approach goes a way to achieving this by breaking down the text and images on a page, to really identify and group articles from individual pages and associated style elements, requires human intervention. Realview has created the technology and the team to make PDF to HTML articles a reality.

Have a look at the above Golf example as a mobile site here (best viewed on your mobile):

Once the information has been extracted from a PDF, there is no limit to how it can be presented digitally, as well as being ready to distribute to services such as Apple News, Snapchat Discover and Facebook Instant Articles.

From standard print ready PDF files, Realview already delivers:

  • Your publication into a fast, branded mobile web viewer.
  • A WordPress plugin to populate your wordpress site with articles from your print publication.

With many more products in development.

Sign up today to express your interest and be the first to get you publication on mobile straight from your PDF!