Might be missing something or simply getting old but I am pretty sure there was a n option to for an OCR operation on PDF documents at check in time (assuming obviously the the OCR module is available).
How does one enforce that ?
The M-Files Community will be updated on Tuesday, April 2, 2024 at 10:00 AM EST / 2:00 PM GMT and the update is expected to last for several hours. The site will be unavailable during this time.
ooh, this is a good one, a head scratcher that we did solve.
For us, we had a fun time figuring it out.
We were migrating scanned PDFs from a legacy system, and wanted M-Files to perform OCR. This put us in the right direction:
We found this thread on the online M-Files Community: community.m-files.com/.../9173
There, Joonas provided the script that prompts M-Files to convert the PDF to OCR. This can't be done on initial import because the file has to already be at version 1.
So our workflow brings in the documents, and then there is a time delay between version 1 being checked in, and then it automatically going to the next workflow state which runs the script to OCR.
Our notes:
Delayed the transition for automatically imported documents to speed up the import process. If the delay is not there, the importer will wait for OCR to finish before moving on to the next document. This allows them to Queue up and the server will scan them afterwards.
Code found at: community.m-files.com/.../10703
'Make workflow pause x minutes before moving on 'Will only work using LastModified (21). Change to (20) to use created date. 'Delay will be minimum the specified x minutes. Can be up to 60 + x minutes depending on when M-Files server checks the conditions. 'Script to be placed in Transition Trigger '2019.07.04 Karl Lausten 'Modified by Jason vonI Nov 22 2021 to use Modified date/time since it is deeper int he workflow. Option Explicit Dim dModified : dModified = PropertyValues.SearchForProperty(21).TypedValue.GetValueAsTimeStamp().UtcToLocalTime().GetValue() 'Desired delay in minutes: Dim iDelay : iDelay = 2 Dim dGoAhead : dGoAhead = DateAdd("n",iDelay,dModified) 'test time settings to verify the setup. 'err.raise mfscriptcancel, "dModified (UTC converted to local time):" & dModified & ", dGoAhead: " & dGoAhead & ", now (local time):" & now if now > dGoAhead then AllowStateTransition = True end if
Good luck!
There are some downsides to this approach if you need to handle many documents in each run. The OCR process by default attempts to handle batches of 100 documents. It checks them all out before it starts processing. If something goes wrong halfway through the process the remaining documents are left checked out. This may not be a big issue if the files have been imported to M-Files. However, if they remain on a network folder, they will be left with a Read Only attribute, and you will need direct access to that network folder in order to remove that attribute. So be careful and consider the implications before you set out to run OCR en masse.
Absolute agree. We have been migrating from a legacy system and for one of our document categories, we had inherited thousands of non-OCRd PDF scans. We did plenty of testing before we committed to this. We ran into lots of issues before we implemented the time delay, then M-Files happily churned out OCRd PDFs
We found better luck with implementation of a commercially available OCR engine Omini Page and writing our own program which uses the Omini Page API, M-Files API and Workflow to automate the ingestion of documents that need OCR. User identifies a source folder and a destination folder (which is in the Network Folder configuration). We then automatically Copy all files to the destination that are not PDF's or Tiffs. We then inspect the PDF files, and if it has already been OCR'd we move it to the destination directory. If Not OCR'd we then perform the OCR task using the API and convert Tiffs to PDF. We also track which documents fail the OCR. We convert all the PDF documents to managed content and identify the source of OCR - Omin Page, External, or Failed. This allows us to use a robust OCR engine and not have the problems found in the M-Files supplied tools..