Page 1 of 2

Best approach for a server-side HTML to image/pdf converter ?

Posted: Thu Dec 10, 2020 10:54 am
by Eric
(followup from https://github.com/salvadordf/CEF4Delphi/issues/329)

Hi,

I am experimenting with using CEF4Delphi server-side to convert HTML to an image or PDF, as a replacement for outdated webkit-based tools like phantomjs or wkhtmltopdf.

What would the simplest / safest way be ? an OSR mode browser or a normal mode one ?
The difference with the demos being there would never be a visible TForm, and it would operate in command line mode.

Just looking for pointers in the right direction in case one option is known dead-end...

To illustrate the issues, one use case would be the website screenshots on https://beginend.net, where phantomjs is used right now, but does not support ECMA6, so it breaks on any "modern" website.

In that use case, the screenshot is generated through phantomjs scripting: the url is loaded, the script waits a bit to give the page time to load, then performs some DOM tricks (like hiding cookie banners...) before finally taking a screenshot.

Re: Best approach for a server-side HTML to image/pdf converter ?

Posted: Thu Dec 10, 2020 11:40 am
by salvadordf
Hi,

Sorry for asking you to move this conversation to the forum but I prefer to leave the GitHub issues for bugs only.

The right approach would be to use a browser in "off-screen" mode (OSR) because it can be used without a real user interface.

The trick would be to add this browser in a console application because many methods in Chromium are asynchronous. You have something similar in the ConsoleBrowser demo but that demo has a windowed browser in a DLL and you don't need that. You would also need to use a different EXE for the subprocesses.

I'll try to create a new demo with something similar to what you describe during the weekend.

Re: Best approach for a server-side HTML to image/pdf converter ?

Posted: Thu Dec 10, 2020 2:22 pm
by Eric
Thanks! no problem about moving the conversation here!

I will look at automating the rest (cookie banner stuff, image conversion options...) to the basic demo exists and report the progress here, in case that is useful to other people.

There are other ways to do it (Selenium, Puppeteer, etc.) but they all involve a rather heavy infrastructure. Having everything controlled directly from an exe like phantomjs did is more straightforward.

Re: Best approach for a server-side HTML to image/pdf converter ?

Posted: Sun Dec 13, 2020 5:58 pm
by salvadordf
Hi Eric,

Please download CEF4Delphi from GitHub and take a look at the new ConsoleBrowser2 demo.

It uses a browser in OSR mode and a different EXE for the subprocesses. All that is encapsulated in a thread and it's used in a console application.

Read the code comments for more information.

Re: Best approach for a server-side HTML to image/pdf converter ?

Posted: Mon Dec 14, 2020 8:28 am
by Eric
Thanks!

I had little it of trouble at first because setting FrameworkDirPath, ResourcesDirPath & LocalesDirPath in the main process is not enough, the subprocess either needs to be in the CEF directory or SetCurrentDir used in the main process (as it carries over to the subprocess).

There is also a non-systematic crash in TCEFBrowserThread.SaveSnapshotToFile when debugging the main executable, which occurs when it's called with a Self value of nil, and the line below does not guard against that

Code: Select all

if (FBrowserInfoCS = nil) then exit;
It would be trivial to guard against a nil Self here, but I am not sure if that would not be sweeping another issue under the proverbial rug ?
Call stack when that happens is

Code: Select all

uCEFBrowserThread.TCEFBrowserThread.SaveSnapshotToFile('snapshot.bmp')
uEncapsulatedBrowser.TEncapsulatedBrowser.Thread_OnSnapshotAvailable(???)
uCEFBrowserThread.TCEFBrowserThread.WebpagePostProcessing
uCEFBrowserThread.TCEFBrowserThread.Execute
:0066a2ba TEncapsulatedBrowser.Thread_OnSnapshotAvailable + $E
:0040a60a ThreadWrapper + $2A
:77c1fa29 KERNEL32.BaseThreadInitThunk + 0x19
:77dc75f4 ntdll.RtlGetAppContainerNamedObjectPath + 0xe4
:77dc75c4 ntdll.RtlGetAppContainerNamedObjectPath + 0xb4
and it occurs on a second call to SaveSnapshotToFile (there is a first "correct" call before)

Re: Best approach for a server-side HTML to image/pdf converter ?

Posted: Mon Dec 14, 2020 11:13 am
by Eric
I have made some tests at using the same executable for the sub-process (switching between main and subprocess behavior based on command line parameters), it seems to work fine.

Is there a hidden reason while it should not be done ?

One reason being that I have seen several times the subprocess executable flagged by antivirus, I guess because the subprocess exe does not do "enough", and gets flagged by heuristics. This happens on and off when building the demos (I have reported them as false positives, with varying success)

FWIW current hacking effort of your demo at https://github.com/EricGrange/cefHtmlSnapshot

Re: Best approach for a server-side HTML to image/pdf converter ?

Posted: Mon Dec 14, 2020 11:41 am
by salvadordf
Thank for reporting this issue! :)

I saw some possible causes of that error and I'll upload the new version as soon as I fix it for Lazarus.

Re: Best approach for a server-side HTML to image/pdf converter ?

Posted: Mon Dec 14, 2020 11:44 am
by salvadordf
The antivirus warning is a known false positive.
Sadly, some dishonest people use CEF too and some antiviruses don't have the best detection algorithm.

Re: Best approach for a server-side HTML to image/pdf converter ?

Posted: Mon Dec 14, 2020 12:31 pm
by salvadordf
I just uploaded a new version with some fixes, more checks and more code comments.

Please, download CEF4Delphi again from GitHub.

Re: Best approach for a server-side HTML to image/pdf converter ?

Posted: Mon Dec 14, 2020 2:21 pm
by Eric
Thanks, works like a charm now.

I have adapted TakeSnapshot for PrintToPDF purposes, and it seems to work as well.

Now on to busting those cookie banners... :D