This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Extract content from a PDF file, how?

M-Files indexer and a number of IML services can look into the text content of text based filetypes, so methods are available. I just can't find a way to extract content in VB script (the API documentation is down at the moment, that makes it difficult to look for it!). Wonder if any of you clever folks might remember a method that would allow me to get the text content of a PDF file and parse it through a regex in order to extract metadata from that content?

Thank you, Karl

Parents
  • The API documentation being down is frustrating...  Trust me. Wink

    Doing what you want via VBScript will be painful.  Have you exhausted using some of the off-the-shelf IML components for this?

  • The IML solutions can probably do this - not sure that they have fully changed from making suggestions to actually setting metadata automatically but it should happen around this time. However, the customer's current license would have to be upgraded meaning triple payment compared to their current license. In that situation it is well worth my time spending an afternoon or a day creating a script. The use case is for standardized incoming purchase orders where it is quite simple to extract the relevant data with regex, and where file length is quite limited. So once I get the text content into a variable in the script is is fairly straight forward. Obviously, it would fail if the incoming file format changes...!

    Was able to find some old postings mentioning the GetTextContentForFile. From the context it seems like it might be the way forward. Hope to get access to the documentation sometime soon to confirm this.

Reply
  • The IML solutions can probably do this - not sure that they have fully changed from making suggestions to actually setting metadata automatically but it should happen around this time. However, the customer's current license would have to be upgraded meaning triple payment compared to their current license. In that situation it is well worth my time spending an afternoon or a day creating a script. The use case is for standardized incoming purchase orders where it is quite simple to extract the relevant data with regex, and where file length is quite limited. So once I get the text content into a variable in the script is is fairly straight forward. Obviously, it would fail if the incoming file format changes...!

    Was able to find some old postings mentioning the GetTextContentForFile. From the context it seems like it might be the way forward. Hope to get access to the documentation sometime soon to confirm this.

Children