splitting pdf slides: 4-up to 1-up
Many 4-up slides are quite inconvenient for onscreen reading. And quite a few times, no other options are available. Here is a small and rough solution to the problem. First, you need to have Python on your system. Download and extract the pyPdf package. Download code from here, to the source directory of the pyPdf package and run it from there. To use it give input and output file names and number of slides per page (<row> <col>) on the input pdf, and the program should work. If the slides don’t show up in proper order, give the order of slides you want, at the end of command line.
If you would like to install the pyPdf package, you would have to copy page4eachXobj function to PageObject in pdf.py and add ‘import copy’ to the top of the file. While copying, be careful with indentation. After this, my file could be placed anywhere.
The file has two methods to convert 4-up/6-up/… slides to a more convenient one slide per page. First method adv_slide_split gives better output(ex: similar to original slides), but works in very few cases. This method doesn’t use the row column information, given at input. Second method slide_split works by dividing the page into equal sized pages, usually not as good as the original 1-up, but works with most of the cases. If possible, the program will use the first method, else the second one.
Rest of the post is about my experience with the problem. Not a required read for using the program.
First off, you need to know that pdf is, in simple words, a collection of data and instructions on how to present the data. Its different/complex from markup languages, as it has to deal with various devices, like printer, and not just the web. And has to be able to represent complex typography and graphics etc, in a concise way.
A pdf file contains lots of <num> 0 obj and endobj pairs in it. Called Indirect Objects, these are used as a reference to objects in the file. Example, from collection of pages to postscript instructions and fonts and colors etc. There are few kinds and types of objects. Name objects starting with ‘/’, dictionary objects enclosed in ‘<<‘ ‘>>’ pair, array objects in ‘[' ']‘ pair, stream object mostly used for data. For our purpose, these should be enough. Indirect objects making it easy to share objects among objects! <num> 0 R is used to refer to an object. I have skimmed, skipped and missed much of the information. If you are interested, read the manual [4.5Mb]. Chapter 3 is the most relevant part.
I was lucky enough to stumble upon a nicely generated slide to dig into, similar to this one . A casual look into the raw file shows /Page tags.
7 0 obj << /Type /Page /Contents 8 0 R /Resources 6 0 R /MediaBox [0 0 841.89 595.276] /Parent 9 0 R >> endobj
With the /Page tags are /Resources with an object reference. The referred object is /XObject dictionary with references to 4 objects.
6 0 obj << /XObject <</Im2 2 0 R /Im3 3 0 R /Im4 4 0 R /Im5 5 0 R>> /ProcSet [ /PDF ] >> endobj
I removed 3 of them from the dictionary. The pdf file indeed showed only the remaining slide (with empty space for others), while xpdf complained about missing references. Manual mentions that /XObject names external objects. That means the names must be used somewhere else. But searching for those names show only the /XObject entries. Page objects entry in manual mentions that /Contents is where the contents of page are described. But /Contents refers to a /FlateDecode stream (essentially a zipped binary of the description of page). Fortunately, I didn’t have to look inside it. Need was just to display those external references, one on each page. The external references in /XObject themselves were of type /XObject.
2 0 obj << /Type /XObject ... >> stream ... endstream endobj
Manual mentions details about the type and how to display it. To have a proof of concept, I changed a Page objects /Contents key to point to a custom stream, and /Page to contain one slide. Here’s the output. More work was needed to convert the whole document. New pages would be needed, as each page would have to be replaced with 4 new once, references would have to be updated, page counts at all levels need to be updated, xref would have to be updated etc.
This needed considerably more effort than I intended. Again, fortunately, I came across pyPdf. Its a versatile library. In the README itself, it shows the trick of cropping a page (Although I would call it zooming). So all I had to do was to copy page 4 times and only show relevant part in each of them.
The above method doesn’t depend on slides being available as separate resources, but used the concept of MediaBox. So it would work with most pdf files. But, if the slides are available as separate resources, the earlier method results in a pdf closer to the original sources. For comparison of the output, here are slides generated with method using XObject and the MediaBox one. Fortunately the library also makes it easier for me to implement the earlier /XObject method. It depends on the internals of pyPdf implementation. So either you would have to copy it to the pyPdf source file, or import all the internal classes of pyPdf.
In all, this is less than a days (cumulative) work.