This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Extract content from a PDF file, how?

M-Files indexer and a number of IML services can look into the text content of text based filetypes, so methods are available. I just can't find a way to extract content in VB script (the API documentation is down at the moment, that makes it difficult to look for it!). Wonder if any of you clever folks might remember a method that would allow me to get the text content of a PDF file and parse it through a regex in order to extract metadata from that content?

Thank you, Karl

Top Replies

Parents

0 Craig Hawker over 1 year ago

The API documentation being down is frustrating... Trust me.

Doing what you want via VBScript will be painful. Have you exhausted using some of the off-the-shelf IML components for this?
Cancel
Vote Up 0 Vote Down

Cancel
0 bright-ideas.dk over 1 year ago in reply to Craig Hawker

The IML solutions can probably do this - not sure that they have fully changed from making suggestions to actually setting metadata automatically but it should happen around this time. However, the customer's current license would have to be upgraded meaning triple payment compared to their current license. In that situation it is well worth my time spending an afternoon or a day creating a script. The use case is for standardized incoming purchase orders where it is quite simple to extract the relevant data with regex, and where file length is quite limited. So once I get the text content into a variable in the script is is fairly straight forward. Obviously, it would fail if the incoming file format changes...!

Was able to find some old postings mentioning the GetTextContentForFile. From the context it seems like it might be the way forward. Hope to get access to the documentation sometime soon to confirm this.
Cancel
Vote Up +1 Vote Down

Cancel
0 godzilla over 1 year ago in reply to 𝕤𝕨𝕖𝕚𝕤𝕖

let me second that request ;)
Cancel
Vote Up 0 Vote Down

Cancel

0 bright-ideas.dk over 1 year ago in reply to 𝕤𝕨𝕖𝕚𝕤𝕖

Beware that computers do not always read content of a PDF in the same order as humans!
To me the document has "Dato: " and then a date in the format DD-MM-YYYY.
However the computer gets the date first and then the text string in this particular instance.
I discovered this by copying the content of the PDF into Notepad and then examine it there. This procedure would be helpful anytime you need to create a regex!

'Extract metadata from file content using regex.
'Suited for documents with a fixed content format and limited file size.
'2022.06.27 Karl Lausten
Option Explicit

Dim objID
set objID = ObjVer.ObjID
Dim objVersion
set objVersion = Vault.ObjectOperations.GetLatestObjectVersionAndProperties(objID, true)
Dim myFiles : Set myFiles = objVersion.VersionData.Files
Dim myFile
Dim fileVer
For Each myFile in myFiles
	set fileVer = myFile.FileVer
Next
'This script was made to run on single file objects. The script is not prepared to handle multiple files!
'Running it on documents with multiple files will create problems such as propertyvalues being overwritten in the proces below.


Dim szFullText
szFullText = Vault.ObjectFileOperations.GetTextContentForFile(ObjVer, fileVer)
Dim szMatch
Dim oPropertyValue
Dim oPattern : Set oPattern = New RegExp

'get Document date, this section can be repeated with different properties and patterns as needed
Dim iPDDocDate : iPDDocDate = Vault.PropertyDefOperations.GetPropertyDefIDbyAlias("PD.DocumentDate")
With oPattern
	.Pattern = "(\d{2}\-\d{2}\-\d{4})(?=\s*Dato:)"
	.IgnoreCase = True
	.Global = True
End With
set szMatch = oPattern.Execute(szFullText)
Set oPropertyValue = PropertyValues.SearchForProperty(iPDDocDate)
oPropertyValue.TypedValue.SetValue MFDatatypeDate, szMatch.Item(0).SubMatches(0)
Vault.ObjectPropertyOperations.SetProperty ObjVer,oPropertyValue

0 godzilla over 1 year ago in reply to bright-ideas.dk

MANY thanks !
Cancel
Vote Up +1 Vote Down

Cancel
0 𝕤𝕨𝕖𝕚𝕤𝕖 over 1 year ago in reply to bright-ideas.dk

What do you guys think about creating a github repo for collecting such code snippets?
Cancel
Vote Up +1 Vote Down

Cancel
0 Craig Hawker over 1 year ago in reply to 𝕤𝕨𝕖𝕚𝕤𝕖

It's something I (at M-Files) have toyed with. The issue is maintaining all the random ones. If this were a community-oriented one, very explicitly saying that it's nothing to do with our official ones, then that may work.

I do very much worry about the maintenance though.
Cancel
Vote Up +3 Vote Down

Cancel
0 𝕤𝕨𝕖𝕚𝕤𝕖 over 1 year ago in reply to Craig Hawker

I guess me and a few others would surely help to maintain it.. Maybe I'll create a repo if I have time and start uploading a few examples.. If you have any suggestions or ideas let me know

Edit:

Maybe a gist would be more suitable than a traditional repo..
Cancel
Vote Up +2 Vote Down

Cancel
0 bright-ideas.dk over 1 year ago in reply to 𝕤𝕨𝕖𝕚𝕤𝕖

There are lots of people around (including myself) who are not familiar with github, repo or gist. We wouldn't know the first thing about searching for stuff in those places. It needs to be on a platform that looks familiar to ordinary people without professional skills as developers.
Cancel
Vote Up +2 Vote Down

Cancel
0 𝕤𝕨𝕖𝕚𝕤𝕖 over 1 year ago in reply to bright-ideas.dk

Most obvious and familiar solution would be using an M-Files Vault itself but I cannot provide the ressources for that.. But I think a gist would not be complicated at all.. But to outline the terminology, GitHub is the Platform and there are 2 repository variants where ghist is intended for code snippets while i would create a classic repo for a whole project..
Cancel
Vote Up 0 Vote Down

Cancel
0 𝕤𝕨𝕖𝕚𝕤𝕖 over 1 year ago in reply to bright-ideas.dk

I created a PDF from Word with only "Dato: 22-01-2001" and copied your script to a workflow state (of course also created the "PD.DocumentDate" but it fails at line 35.. szFullText contains the text but szMatch.Count is 0 so I guess he hasn't found anything..

Update:

for me this expression works: "(Dato:\s*)(\d{2}\-\d{2}\-\d{4})"
Cancel
Vote Up 0 Vote Down

Cancel
0 bright-ideas.dk over 1 year ago in reply to 𝕤𝕨𝕖𝕚𝕤𝕖

Yes, as I mentioned in my comment to the code - you always have to adapt the regex to the way the computer sees the text. In my case computer read the "Dato:" part as being next line after the actual date even though it appear to be in front of the date.
Cancel
Vote Up 0 Vote Down

Cancel

Reply

0 bright-ideas.dk over 1 year ago in reply to 𝕤𝕨𝕖𝕚𝕤𝕖

Yes, as I mentioned in my comment to the code - you always have to adapt the regex to the way the computer sees the text. In my case computer read the "Dato:" part as being next line after the actual date even though it appear to be in front of the date.
Cancel
Vote Up 0 Vote Down

Cancel

Children

No Data