Synchronize a PDF with a spreadsheet

Project Logs

Collapse

Landing page + cell moves
lion mclionhead • 04/09/2025 at 00:14 • 0 comments

Made a landing page which works quite well. 1 problem is every pdfmerge.py has to be restarted manually when the tokens expire. That happens every 24 hours. It's the new pain with the newfound ease of running multiple pdfmerges.

The big problem continued to be repositioning cells. For now, sheets have just grown down & right as missing fields were added. There's also been just filling out the entire spreadsheet before filling out the form.

It would start by loading the entire sheet in 1 GET. The smart way might be hard coding a maximum sheet size or selecting the bottom right cell + 10. Specifying A1:Z256 as the range makes it return just the cells which are taken. If a form needs more than that range, you have bigger problems. There has to be a new rule where a form can only use 1 spreadsheet.

Then it would search in a radial pattern for matching data around blank cells. The user would have to do the search operation right after moving the data & before changing the data. For now, it only searches 2 cells in 4 directions. This tries to avoid the chance of picking the wrong cell while still allowing the user to accidentally move a cell 2 places over. The previous behavior only allowed shifting a 1 row or column at a time.

The key errors are no matching data in another cell, the matching cell being too far away, the moved cells being different offsets.

It tries to give some minimal diagnostics in a modeless dialog.

The next step was making the resync operation download the entire sheet instead of loading 1 cell at a time. This was much faster than loading 1 cell at a time, but was the final stage of limiting it to 1 sheet per form.

To handle multiple documents sharing the same sheet, what's really needed is a way to combine multiple forms in a single project & have a single pdfmerge.py serve them all. They would have to poll the server for changes. It can't rely on any command line tools because the authentication requires a browser.

It's become a growing problem with growing complexity.
Wishlist
lion mclionhead • 02/25/2025 at 19:34 • 0 comments

Lions still have a lot of errors in their tax returns. It not a bug in pdfmerge but the complexity of the tax code. It might be smarter to just use paid tax software. Most animals have all their investments in tax advantaged accounts & all the rest in a house, so they're not dealing with the problem areas. Lions never bought a house so their tax situation is upside down.

----------------------------------------------------------------------------------------------------

The killer limitation is there's no real way to move cells around after they're used in a PDF. Since spreadsheets are shared between multiple forms, every cell move would require remembering every single form which used it & reapplying the same cell shift commands.

The easiest solution would be Goog supporting aliases for cells. There is a way to make an alias for a cell with right click -> more cell actions -> define named range The named range follows the cell when it's moved.

All the named ranges are accessed in the top left.

The named range can be used inside PDFmerge to access the data, but it's really cumbersome to copy. It can't have any spaces. There's no way to get a link to the named range like there is for a cell. The only way is to highlight the name in the menu.

Then copy it & paste it into the URL inside PDFmerge. Maybe if the data is important enough, this can be justified. There might be a way to automatically create a named range for every cell in a form & replace all the cells with the named ranges in every form. PDFmerge has a replace cell function which can replace a single cell with a named range.

Getting back to moving cells around, there could be a command for pdfmerge to search for all of its values which don't match the values in the spreadsheet in the neighboring cells, using an expanding search. The leading mismatch would be a value in the form linking to a blank cell in the spreadsheet. That could at least automate the cell movements in a single form. The most common case seems to be for cells to get shifted around & new data to be added before the next sync, but the change doesn't affect the previous values. There could be a button which performs the search for moved cells outside of the sync operation.

------------------------------------------------------------------------------------------------------------------------------------------

Another pain point is the challenge of opening the right localhost:80xx address in the browser for the right filename. The best option might be multiple instances of pdfmerge in a single 8080 server with the filenames in the URL. The easiest option might be a command line tool which shows all the pdfmerge addresses & what file each one is editing. It would run ps xa|grep pdfmerge.py to get all the PIDs with filenames, then netstat -tulnp to show what ports they're bound to.

The web app should have a redirect page with links to all of its own instances with the filenames.

Having to use a web app with goog's spreadsheet API is the source of a great many limitations. A ground up spreadsheet program & native form editor would be ideal. It would entail importing all the data from goog sheets & open office.

-----------------------------------------------------------------------------------------------------------------------------------------

The syncing process with goog sheets is glitchy & slow at best. Sometimes it needs a few tries to get the latest values. It really needs to download the entire spreadsheet with a single GET instead of shooting a new GET for every cell. This means a form can only reference 1 spreadsheet, but this has been the only use case.

An idea gaining favor is going back to open office & using an XLS to CSV converter to sync with the forms. There's no way to copy the...
Read more »
Goog sheets pain
lion mclionhead • 04/12/2024 at 08:44 • 0 comments

As usual with cloud/web app/server based software, goog sheets has shown more & more problems over time. The biggest problem for PDF merge is the authentication glitches.

Noted if it prompts for a login during a resync, it logs in successfully but fails to update any values. You have to resync again. It hasn't been easy to test since there's no easy way to make the login expire.

Other than that, goog sheets doesn't allow entering multiple lines in a cell. Openoffice allows entering multiple lines with ctrl-enter. Multiple lines in goog sheets can be fenegled by pasting all the lines at once. Then the problem becomes selecting or editing multiple lines, which entails copying the entire cell into another text editor.

Browser based text editing has always sucked a bag of dicks. PDFMerge & goog sheets both suffer in their text boxes. Writing a spreadsheet program from scratch would be a big deal. It's hard to imagine something as platform independent as the current system being any less bug ridden. If it was just for Linux using scratch X11 calls, it would be solid. These glitches are why bare X11 programs continue to persist. The API is just abstracted enough to be more effective than a framework for a lot of tasks but just primitive enough to make a full GUI extremely laborious.
Search & replace
lion mclionhead • 11/11/2023 at 20:05 • 0 comments

Managed to do a tax return with it. Running a new server.py for every form is a bit tedious but it's better than before.

1st big bug of the year was a failure to resync the values. It stopped when it hit the 1st blank cell because if a get_cell returned no "values" tag, it assumed it needed to log in again. The test needs to use something else to log in, but the goog doesn't provide any "values" tag for an empty cell. For now, it assumes if the "range" & "majorDimension" tags exist, the cell is empty.

If it returns nothing, Mozilla retries 5 times before returning an error code. It has to return __LOGIN or __BLANK so there's no way a sheet can contain those values. Then, it draws a really small bounding box if there's no text or a space. In the end, it writes BLANK when a cell is empty.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The mane unknown is how many linked spreadsheet values it supports before it gets too slow. It's really slow by 4. The mane problem is goog sheets requires a unique GET for every link.

The world's most hated form showed it's already way beyond the old days. It's still unknown of the context menu should always show PASTE OVER TEXT if an overlay is highlighted or only if the click is directly over a highlighted overlay. Sometimes you want to copy multiple overlays by clicking outside the selection & sometimes you want to paste new text while something is selected. It might just need both paste options in 1 menu. The problem then becomes what ctrl-v does if something is highlighted. It's less confusing if ctrl-v overwrites the selection. Text editors deal with the problem with a single paste option.

The biggest problem is when moving data in the spreadsheet, you have to manually search the form for every occurrence of that data & repaste it. The number of occurrences can be ponderous in a tax form. It almost defeats the purpose of synchronizing the spreadsheet.
For now, it got a search & replace wedged in. It just replaces all the cells at once.
Keyboard shortcuts
lion mclionhead • 11/11/2023 at 05:07 • 0 comments

The trick with this is some keyboard shortcuts should stay the browser defaults, some shouldn't, & some can't override the defaults at all. Only experience can tell. There's no way to override ctrl-n. There are a lot of corner cases.

With zoom, an intuitive behavior is to press fit width which makes the button the active widget, then press ctrl-+/ctrl-= to zoom. That just zooms the button instead of the canvas.

Another intuitive behavior is copying a text item & immediately pressing ctrl-v which overwrites the already selected text instead of creating new text. If multiple text objects are selected, it overwrites all of them with 1 of the text items. The problem is it can't match the clipboard items with the selected objects so you have to remember to deselect before ctrl-v.

Kind of a mess with all the keyboard shortcuts. The idea is to make it as easy as possible to get started.

It would be nice to know the DPI of the current zoom in print mode. It could print the DPI next to all the zoom menu items. The internet says PDF has no notion of DPI. In reality, pdfjslib always seems to assume 72fpi for 100% zoom. The menu just has to hard code 72 * the zoom factor.

Some heroic javascript hacking got it to right justify the DPI. With that, this program was just about done, pending ongoing bug fixes. The real test wouldn't be until the W2 forms came in. Kind of an insane amount of effort to do so little.
Row/Column shifting
lion mclionhead • 11/09/2023 at 22:23 • 0 comments

The least valuable & final planned feature was the row/column shifting. These 4 operations of such lethal cunning were the only way to synchronize with shifting operations in the spreadsheet. Goog sheet labels the columns letters & the rows numbers. After Z, it goes to AA, AB, AC. For even the most insane financial calculation, the lion kingdom would only go to Z so that was the plan.

Thus, there was a simple test form. Helas, the shifting operations get really slow. There definitely needs to be an hourglass or status symbol.

After so many years of writing web apps, the lion kingdom finally remembered the tool box should go on top. Somehow, the browser tripped up that convention. It more easily allows the canvas to fill the window.

Maybe lions got nostalgic for early 80's graphics programs.

There's bugger all footage of those early graphics programs, but up to at least 1983, they tended to put menus & command lines on the bottom. The way the Images Paint System drew a palette right over the bottom looks hacky now but it was pretty slick in the day. PDFMerge is about as primitive as those early graphics programs.

Another idea was keyboard shortcuts for undo, redo, save, delete, cut, copy, paste, next page, prev page, zoom, select all. Key trapping in javascript is a difficult thing. You have to seek out every place a key could be trapped & manually disable it, replacing it with your own event handler. Text boxes need different event handlers than canvases. It's not like there's a concept of a top level event handler & returning true traps the event.

It looks like goog sheet itself only traps pgup, pgdn in some widgets but not all.
Print
lion mclionhead • 11/09/2023 at 07:47 • 0 comments

For the print mode, it just draws the current page as a normal image. The user can right click & save it as a PNG image at the current zoom level, adjust the zoom, navigate 1 page at a time.

The mane problem was handling mouse wheel zooming when the scroll bars were enabled. It has to set new scrollbar positions in addition to computing a new X & Y from the previous scrollbar positions. Another problem was indicating the current zoom level. It was decided to just not support mouse wheel zooming in viewer mode, for now. It's supposed to just save a PNG image.
cut, copy, paste operations
lion mclionhead • 11/08/2023 at 03:27 • 0 comments

The mane problem with clipboard operations is what goog sheets copies can either be a link or plain text. What pdfmerge needs to copy is the relative X, Y, font size, the text, the link, multiple text objects. Thus all the paste operations need to handle 2 cases: when the clipboard contains a JSON of text objects & when it contains plain text. A paste operation for a new text object needs to create all the copied objects in the same relative positions. A paste operation overwriting a single text area needs to write the 1st text in the clipboard.

Thus we have ways of copying multiple text objects with the same relative positions to other positions & pages.

Replacing just the text in an existing object with a copy of another object.

Replacing the text in an existing text object with copied plain text.

Pasting plain text as a new text object.
Of course links to goog sheet cells can still be pasted as new text objects or pasted over existing text objects.
It can't paste a range of goog sheet cells. That would entail a whole suite of formatting commands to get the right line spacing.
Undo & redo
lion mclionhead • 11/06/2023 at 00:37 • 0 comments

It might be good enough without it, but the undo stack was the last big requirement. The print function may or may not be a buster. Another really nice but unessential feature is a grid.

With the login sequence forcing reloads, the undo stack becomes eternal & server resident. This would be a memory resident undo stack on the server. Kill the server & lose the undo stack like a regular program.

Lions have always saved an undo & redo stack. In this case, the last redo buffer would also go on disk as a saved project. Helas, when the user undid a level, the saved project would no longer be the current redo buffer & the browser would have to show a modified flag. The big problem is the program has to save the project to handle a login attempt, which erases the undo history. The mane things which cause a login attempt without the user editing anything are a page reload & the resync button. It's almost useless beyond 1 level of undo.

Thus began the battle of the apply buttons.

Text entry & font size need apply buttons to propagate text from the user entry to the PDF & they need an edit button to propagate text from the PDF to the user entry. It's a compromise a lot of PCB editors make to keep every keypress from expanding the undo stack. Automatically updating the PDF while typing in text doesn't allow entering replacement text & later selecting a destination for it. Automatically copying the PDF to the user entry upon selection created the problem of unintentionally overwriting previously entered replacement text.

The mane reason font size & text contents need apply buttons is the chance of keypresses getting dropped because the program is busy saving undo buffers & redrawing. It's not a problem to drop scroll wheel events but all the keypress events needs to be processed.

A final reason for having to apply the text contents is the chance of a change in the text triggering an open auth login. That would be very disruptive.

The only automatic update is a mouse wheel change to the font size. Undoing mouse wheel events for the font size is the hardest undo operation. The undo operation is too slow to undo every incremental font size change but some clever programming can compress all the incremental size changes into a single undo & redo buffer.

All this yielded a somewhat complicated system of edit & apply buttons. Maybe there are more automated ways of handling text entry with asynchronous polling loops.

That just leaves print, row, column, cut, copy operations.

Finally, the killer bug is when a fetch fails, it kills the async chain & the busy flag is never unset. The undo buffer is definitely hosed if there's a network error but there's a good chance the document is still preserved.
Every fetch call needs a
.catch(e => {
console.error(e);
});
to avoid crashing.
Service account blues
lion mclionhead • 10/26/2023 at 18:41 • 0 comments

Managing the oauth consent prompts was turning into an impossible dream. The mane problem was the number of places the text drawing could be interrupted & the browser could have to be redirected to a consent prompt, then the text drawing could have to resume.

The internet recommended using a service account for applications that used a server. The general procedure for creating the service account is given on

https://robocorp.com/docs/development-guide/google-sheets/interacting-with-google-sheets

The trick is it generates a JSON file for the service account. You have to copy the client_email from the JSON file to the access list for the spreadsheet. That user has to be set to viewer mode.

Helas, a similar oauth dance is required to get an access token for the service account, except the job of the consent prompt can be performed by the server without interrupting the browser.

Both the goog & the GPT drew blanks for this procedure. There just aren't a lot of animals writing programs in their dorm rooms for accessing the goog.

There was a buried php blob.

https://stackoverflow.com/questions/67338219/a-google-sheet-api-http-request-with-service-account-key-auth-and-without-goog

After translating to python, this failed with 1 of 2 errors:

"error": "invalid_grant",
"error_description": "java.security.SignatureException: Invalid signature for token

or

curl output={"error":"invalid_grant","error_description":"Invalid JWT Signature."}

The only clue was + being replaced by spaces in the response. == was replaced by \u003d\u003d

The --trace-ascii [filename] option causes curl to save all the TCP data to a file, showing all the + & = were properly sent.

-----------------------------------------------------------------------------------------------------------

After many dead ends, the service account idea got abandoned. It actually is more laborious for the user to deal with the JSON file & access list, hence why they're probably not used. The consent prompt only requires the user's existing goog account & the developer console.

The problem with the consent prompt is the need to go into a login sequence in the middle of a redraw operation. The easiest workaround, since it's a rarely used program by 1 lion, is to just save the project & deal with the login sequence when reading a spreadsheet fails. 99% of the effort is preventing an infinite loop if the login fails. It requires multiple reloads. Reload #1 loads the consent prompt. Reload #2 loads the page with a code= argument in order to load the access token. Reload #3 deletes code= from the URL so a user reload works. A state variable in the project tells whether it's in a login sequence or a user reload so it doesn't get stuck reloading.

Maybe there's a way to do it with popups or multiple html files.

Every javascript program involving network calls becomes a nest of async calls with callbacks. They tried to solve this with the promise notation, which lions believe just replaces the callback with a then() block. If you want a busy flag to be set before the 1st async call & unset after the last async/promise in the chain, you're out of luck. There's no mention of the 1st then() waiting for the last nested async. The lion kingdom settled on just calling everything with await instead of using callbacks or promises. The language doesn't stop you from forgetting to use await if the function is async though.

Most language features these days are marketing tools for selling the language, yet programmers still have yet to believe languages are subject to the same business economics as beer & pizza. For a 1 user program where no-one cares about the browser freezing, the mandate for network calls to be async has no purpose besides a marketing tool to sell the language.

That got it to finally pasting & syncing...
Read more »