The M-Files Community will be updated on Tuesday, April 2, 2024 at 10:00 AM EST / 2:00 PM GMT and the update is expected to last for several hours. The site will be unavailable during this time.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Auto perform OCR at check-in

Might be missing something or simply getting old but I am pretty sure there was a n option to for an OCR operation on PDF documents at check in time (assuming obviously the the OCR module is available).

How does one enforce that ?

  • ooh, this is a good one, a head scratcher that we did solve. 

    For us, we had a fun time figuring it out. 

    We were migrating scanned PDFs from a legacy system, and wanted M-Files to perform OCR. This put us in the right direction: 

    We found this thread on the online M-Files Community: community.m-files.com/.../9173

    There, Joonas provided the script that prompts M-Files to convert the PDF to OCR. This can't be done on initial import because the file has to already be at version 1. 

    So our workflow brings in the documents, and then there is a time delay between version 1 being checked in, and then it automatically going to the next workflow state which runs the script to OCR. 

    Our notes:

    Delayed the transition for automatically imported documents to speed up the import process. If the delay is not there, the importer will wait for OCR to finish before moving on to the next document. This allows them to Queue up and the server will scan them afterwards.

    Code found at: community.m-files.com/.../10703

    'Make workflow pause x minutes before moving on
    'Will only work using LastModified (21). Change to (20) to use created date.
    'Delay will be minimum the specified x minutes. Can be up to 60 + x minutes depending on when M-Files server checks the conditions.
    'Script to be placed in Transition Trigger
    '2019.07.04 Karl Lausten
    'Modified by Jason vonI Nov 22 2021 to use Modified date/time since it is deeper int he workflow.
    
    Option Explicit
    Dim dModified : dModified = PropertyValues.SearchForProperty(21).TypedValue.GetValueAsTimeStamp().UtcToLocalTime().GetValue()
    
    'Desired delay in minutes: 
    Dim iDelay : iDelay = 2 
    Dim dGoAhead : dGoAhead = DateAdd("n",iDelay,dModified)
    
    'test time settings to verify the setup.
    'err.raise mfscriptcancel, "dModified (UTC converted to local time):" & dModified & ", dGoAhead: " & dGoAhead & ", now (local time):" & now 
    
    if now > dGoAhead then
    			AllowStateTransition = True
    end if

    Good luck!

  • There are some downsides to this approach if you need to handle many documents in each run. The OCR process by default attempts to handle batches of 100 documents. It checks them all out before it starts processing. If something goes wrong halfway through the process the remaining documents are left checked out. This may not be a big issue if the files have been imported to M-Files. However, if they remain on a network folder, they will be left with a Read Only attribute, and you will need direct access to that network folder in order to remove that attribute. So be careful and consider the implications before you set out to run OCR en masse.

  • Many thanks for that.

    Still I rember doing a drag & drop of some documents into a vault and M-File prompting me to perform OCR before check-in.

    Did I dream it ?

  • I also think there was something like that, can it be that OCR Module was included by default in older releases (M-Files 2015/2018)..?

  • Ah, if you drag and drop image files, like jpegs, tiffs - then M-Files will ask if you want to convert to PDF

  • Absolute agree. We have been migrating from a legacy system and for one of our document categories, we had inherited thousands of non-OCRd PDF scans. We did plenty of testing before we committed to this. We ran into lots of issues before we implemented the time delay, then M-Files happily churned out OCRd PDFs

  • We found better luck with implementation of a commercially available OCR engine Omini Page and writing our own program which uses the Omini Page API, M-Files API and Workflow to automate the ingestion of documents that need OCR.  User identifies a source folder and a destination folder (which is in the Network Folder configuration).  We then automatically Copy all files to the destination that are not PDF's or Tiffs.  We then inspect the PDF files, and if it has already been OCR'd we move it to the destination directory.  If Not OCR'd we then perform the OCR task using the API and convert Tiffs to PDF.  We also track which documents fail the OCR.  We convert all the PDF documents to managed content and identify the source of OCR - Omin Page, External, or Failed.  This allows us to use a robust OCR engine and not have the problems found in the M-Files supplied tools..