M-Files OCR Module

We have a client that's using Smart Classifier in their vault. However, we're noticing that a lot of documents are skipped for classification due to lacking a text layer. We were looking at options for OCR, and came across the OCR Module for M-Files. Reading the description of the module, it looks like the OCR module can be enabled for documents that are scanned into the vault. Does this include documents that are added to the vault via "drag n drop," and can this be an automatic conversion to a fully text-searchable pdf?  

or, are there other options for this kind of requirement?

Thank you

Parents
  • When the client has Smart Classifier, then they also have the option to use Discovery. With a bit of trickery you can indeed configure Discovery to identify PDF files without a text layer and then add a specific property to those documents. This property can be used to trigger a workflow that runs OCR on the document and saves the result as a text layer in a new version of the PDF. So it is possible but it has limits. The OCR process is not suitable for handling large quantities of files mainly because of a relatively high load on the server and particularly because it attempts to handle up to 100 files in each batch. If any one of them goes wrong the remaining files in that batch may end up in a limbo. So it would probably be OK to handle a few new files pr hour but be careful if you need to handle thousands of files.

Reply
  • When the client has Smart Classifier, then they also have the option to use Discovery. With a bit of trickery you can indeed configure Discovery to identify PDF files without a text layer and then add a specific property to those documents. This property can be used to trigger a workflow that runs OCR on the document and saves the result as a text layer in a new version of the PDF. So it is possible but it has limits. The OCR process is not suitable for handling large quantities of files mainly because of a relatively high load on the server and particularly because it attempts to handle up to 100 files in each batch. If any one of them goes wrong the remaining files in that batch may end up in a limbo. So it would probably be OK to handle a few new files pr hour but be careful if you need to handle thousands of files.

Children
No Data