This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OCR mail attachment

Former Member over 6 years ago

In your opinion, what's the best/cleanest way to OCR mail-attachments (TIF files from MFC device) that are imported through an external mail source. So, just an e-mail with a tif-file that need to be indexed.

While the External File Import source has the option to OCR the files that are being imported, the mail connector does not have such thing.

We can use a workflow action to OCR, but this seems to have an unpredictable result where OCR isn't performed as it should, it only creates a PDF, but not a searchable one.

I want to prevent that we have to create a solution where mail source imports have to be downloaded/extreacted by an event handler to a file system folder to be imported again in a regular external file import job, it's a bit cumbersome.

godzilla over 6 years ago

Maybe your best best would be to have some utility to detach the TIFF to a folder and have it as the source to your import (that's how we solved this issue...)
Cancel
Vote Up 0 Vote Down

Cancel
Former Member over 6 years ago

Maybe your best best would be to have some utility to detach the TIFF to a folder and have it as the source to your import (that's how we solved this issue...)

That's a scenario I thought of, but to be honest, I don't understand why the OCR option is not a default option in the mail import source, as it is in the file import source.

Any scripts or hints for utils are very welcome.
Cancel
Vote Up 0 Vote Down

Cancel

Joonas Linkola over 6 years ago

We can use a workflow action to OCR, but this seems to have an unpredictable result where OCR isn't performed as it should, it only creates a PDF, but not a searchable one.

Did you run the OCR operation in a workflow state action script or just use the built-in PDF conversion option in the workflow state? The PDF conversion doesn't do OCR as far as I know, you need to trigger OCR in a script. Here's an example:

Option Explicit

' Prepare the files of the object for modification by script.
Dim files
Set files = Vault.ObjectFileOperations.GetFilesForModificationInEventHandler( ObjVer )

' Prepare OCR options.
Dim opts
Set opts = CreateObject( "MFilesAPI.OCROptions" )
opts.PrimaryLanguage = MFOCRLanguageEnglishUS
opts.SecondaryLanguage = MFOCRLanguageFinnish

' Perform OCR on each of the convertible files.
Dim file
For Each file In files

	' Is the file in a convertible file format?
	If file.Extension = "tif" Or _
		file.Extension = "tiff" Or _
		file.Extension = "jpg" Or _
		file.Extension = "jpeg" Or _
		file.Extension = "pdf" Then

		' Convert this file to searchable PDF.
		Vault.ObjectFileOperations.PerformOCROperation ObjVer, file.FileVer, _
				opts, MFOCRZoneRecognitionModeNoZoneRecognition, Nothing, True

	End If

Next

Former Member over 6 years ago

Hi Joonas, the script looks promising, I'll give it a go. Thanks!
Cancel
Vote Up 0 Vote Down

Cancel
Former Member over 6 years ago
Hi Joonas,

1:

Basically the OCR is working, however sometimes I get eventlog errors regarding a timeout in the OCR Action Script when OCR'ing somewhat larger tif files (4MB+).

Vault.ObjectFileOperations.PerformOCROperation ObjVer, file.FileVer, opts, MFOCRZoneRecognitionModeNoZoneRecognition, Nothing, True

Is there some tuning possible, because these aren't excessive file sizes.

2:

the script sometimes crashes on PDF files that have a digital origin, like a tekst document that was printed to PDF format.

How would one check if a pdf file is image only, image+text (already OCR'ed), or only digital text ?
Cancel
Vote Up 0 Vote Down

Cancel
Joonas Linkola over 6 years ago

1. The OCR tuning options are listed here: Registry settings for scanning and OCR in M-Files

If it seems the workflow action scripts are taking too long to complete and tuning doesn't help, the next option would be to create a VAF background operation that searches the vault for objects to OCR (based on the workflow state, for instance): developer.m-files.com/.../

2. There is no function in M-Files API to check if a PDF is already searchable. What you could do programmatically is to check if the PDF file contains strings like "FontName", that typically indicates that a PDF is searchable.
Cancel
Vote Up 0 Vote Down

Cancel
godzilla over 6 years ago

There is no function in M-Files API to check if a PDF is already searchable. What you could do programmatically is to check if the PDF file contains strings like "FontName", that typically indicates that a PDF is searchable.

Bit of of a thread drift but I'm sure this has been already been discussed multiple times here - it wold be a very welcome addition (ideally a built in property "is searchable").
Cancel
Vote Up 0 Vote Down

Cancel

OCR mail attachment

Contact Us

Schedule a Demo

Careers

Trust Center

Privacy Policy

Security Hall of Fame

M-Files Community

Support Portal

Help Center

Product Center

Download M-Files

User Guides

Product News