Post Reply 
OCR'ing line printer listings
02-12-2023, 08:02 PM
Post: #1
OCR'ing line printer listings
I'm trying to OCR the hp9815 ROM listing from the patent (4089059) in order to make something searchable, and ideally re-assemblable as a sanity check.

I've used http://www.onlineocr.net and the results are pretty good (at least compared with the others I tried) but there are still many corrections to be made as well as formatting differences. I think it's probably many times easier than retyping manually, though.

As I work through the corrections, In see many similar character recognition errors, but there are many hints that could perhaps be automated.

- Line printer font, whilst often broken, does degrade in specific ways
- The line number should be contiguous
- The addresses increment in a predictable manner
- There is a correlation between data and assembler symbols
- The opcodes, register names etc. are from a limited set

All these would help improve accuracy, but the OCR system is trying to recognise natural language in one of several human languages (selectable). Ideally, it should have options for at least font and assembler syntax.

Is there an OCR system that's been specifically trained on lineprinter output or even assembly code ?
Find all posts by this user
Quote this message in a reply
02-13-2023, 10:36 AM
Post: #2
RE: OCR'ing line printer listings
(02-12-2023 08:02 PM)artag Wrote:  I'm trying to OCR the hp9815 ROM listing from the patent (4089059) in order to make something searchable, and ideally re-assemblable as a sanity check.

I've used http://www.onlineocr.net and the results are pretty good (at least compared with the others I tried) but there are still many corrections to be made as well as formatting differences. I think it's probably many times easier than retyping manually, though.

As I work through the corrections, In see many similar character recognition errors, but there are many hints that could perhaps be automated.

- Line printer font, whilst often broken, does degrade in specific ways
- The line number should be contiguous
- The addresses increment in a predictable manner
- There is a correlation between data and assembler symbols
- The opcodes, register names etc. are from a limited set

All these would help improve accuracy, but the OCR system is trying to recognise natural language in one of several human languages (selectable). Ideally, it should have options for at least font and assembler syntax.

Is there an OCR system that's been specifically trained on lineprinter output or even assembly code ?

Have you tried Adobe Acrobat (full version, not reader) Its OCR is very good with line printer output, depending on the quality of the scan, of course.

Tom L
Cui bono?
Find all posts by this user
Quote this message in a reply
Post Reply 




User(s) browsing this thread: 1 Guest(s)