This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Text Analytics - PDF - Regex Issue

Hey

I'm new to Regex but working my way through,
I have the correct syntax i need to find certain words when testing with a text file, but when i use it on the PDF it will not work,
Must be an issue with Columns or spacing etc.

i was wondering if someone could point me in the right direction of how i could get it to find the Value.
In the attachment bellow, it would be '0.27'

{
    "targetName": " Total Ex GST",
    "documentContentPattern": "(Total Charges Excluding GST:\s*\$(?<value>(\d{1,3}(\,\d{3})*|(\d+))(\.\d{2})?$))",
    "Comment": "",
    "enabled": true
}


Also, could i get this to map to a Property Field that was set to Number (real)

  • Hi Pete
    Your PDF is probably designed using some sort of forms. To ours eyes the line your are attempting to analyze looks like one continuous line but actually it is two contiguous fields. Therefor the computer might read it differently. Try to mark the text in the PDF and copy it to e.g. Notepad. You will probably get the text part in one line and the amount in the next line. If so you will have to adjust your reqex accordingly.
    Last I checked it was not possible to map results to number fields (don't know why). You can map it to a text field.
  • Hey

    Yeah that's what i'm thinking is happening. in notepadd ++ its showing as just 1xspace but i'm sure that's wrong.

    Would you know the Regex needed to search across the box?

    cheers
  • If my assumption is correct you can insert \r\n (Carriage Return & New Line) in the regex.
    I have successfully tested a simplified version of your regex i Regex Hero:
    (Total Charges Excluding GST:\s*\r\n\044(?\d+\056\d{2}))

    Notice that I have replaced your \$ with \044 and your \. with \056
    Those are the 3 digit Octal codes for the characters in the ASCII table.

    Just had a another go at my own vault. Can confirm that Analytics still won't deliver input to Number type properties, it only works on Text type properties.
    Also noticed that parts of my invoices are read column by column, so in my case I got 2 labels and then 2 amounts.
    And even though there were CR-LF characters in text when copied to Notepad it was not necessary to include \r\n in my regex as long as I had \s* in stead. Apparently \r\n are considered to be white-space characters.
  • thanks for that,
    im using Expresso to test and your \r\n was what i was needing.
    thanks allot.

    Quick Question, It seems like M-Files doesnt need the \s (\s) to show a space between words, is this correct?

    Expresso
    (Total\sCharges\sExcluding\sGST:\s*\r\n\044(?\d+\056\d{2}))

    M-Files
    (Total Charges Excluding GST:\s*\r\n\044(?\d+\056\d{2}))

    cheers
  • If - like in your case - you have a predefined text with spaces in it, then you can just include those spaces in the text as shown in your bottom example.
    \s will match any "white-space" character which can be space, tab, carriage return or new line etc.
    \s* will match zero or more characters, \s+ will match 1 or more characters.
  • Your a legend mate, thank you very much