This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Slow XML import

Former Member
Former Member
Hia,

We have a conversion mechnism running that's converting legacy postscript output to generic PDF with formatted XML (1 pdf = 1 pdf, same naming).
On a daily base about 5000 files are generated and imported with an external file source job using the XML and xpaths for meta data.
It's all working nicely, however, the import into M-Files is so slo...o..w.....

Any experience here to speed up this process, to increase the # of files per import (for example 500 in stead of 100) etc?
The (virtual) hardware shouldn't be the issue, nor the SQL vault database.

We have a backlog of let's say a few million files, so any speed increase would be great.


PS: best wishes all!
  • Are you using text recognition (OCR) in the file source job? OCR processing utilizes a large amount of available CPU resources during import which can slow it down. If you have OCR enabled, try running the import job with it disabled and see what effect that has. If the source PDFs are already text searchable OCR processing is unnecessary anyway.

    If you are not using OCR, then could you please describe what is the current speed of the import? How long does it take to import, say, 100 documents?
  • Former Member
    Former Member
    Hia,

    OCR is not used in this case and can be ignored (as test: importing flat Tif image + XML files has the same performance).
    Importing 100 documents lies between 100 to 150 seconds, so basically 1/1,5 second a file, or about 100.000 per day when running full time.

    That is a lot ofcourse, but when you speak of millions, it means we can import about 3 million files per 1 month, meaning we need 4-5 months of full time importing.
  • Former Member
    Former Member
    Are you using the import tool?

    The GUI has a memory leak that causes it to slow down to a crawl, if you can use CLI version it is much faster.

  • Importing 100 documents lies between 100 to 150 seconds, so basically 1/1,5 second a file, or about 100.000 per day when running full time.


    That sounds about normal, we usually estimate importing speed to be ~1 document/second if we don't have details about the environment. The actual speed depends on a lot of variables such as the hardware specs, the network, the complexity of the metadata structure and any event handlers and other scripts that need to run when the document is added to the vault etc.
  • Former Member
    Former Member
    That sounds about normal, we usually estimate importing speed to be ~1 document/second if we don't have details about the environment. The actual speed depends on a lot of variables such as the hardware specs, the network, the complexity of the metadata structure and any event handlers and other scripts that need to run when the document is added to the vault etc.

    The environment is no issue. All data is local for the server and there's nothing complex about plain text property import. Server CPU isn't going over 10% and memory is barely dented for the process.

    What would be the impact if we duplicate the import job to a second or third, and do parallel imports? Is M-Files ready for concurrent threads in this manner and would that increase import speed?
  • Former Member
    Former Member
    something strange.

    I manually duplicated the import job to a job_2. Everything is exactly the same, except for the name_2 and the source folder_2 (which is also local just like first job).

    So now I have the jobs: Import_Converted_Files & Import_Converted_Files_2

    The second job_2 imports like 5-10 times faster than the first job. Exactly the same files, meta data etc.
    How??

    Is M-Files doing something funny with the external source name property? It feels like that there is some form of ever growing registry causing the job to become increasingly slower?

    Any clue? or is this a glitch in the software?
  • Former Member
    Former Member
    And the same again. 2nd job is slowing down.
    Creating a new, identical, third job_3 will import much faster again, about 5x faster.


  • What would be the impact if we duplicate the import job to a second or third, and do parallel imports? Is M-Files ready for concurrent threads in this manner and would that increase import speed?


    If there is a logical way to split the importing to several jobs that might speed things up since each import job runs in its own thread I think. However I'm not that familiar with the technical details of importing so you could check this from technical support if you have a maintenance subscription. Support may also have an explanation for why the importing slows down after a while.
  • Former Member
    Former Member
    I contacted support and will post any findings.
  • Former Member
    Former Member
    Have you tried using the non-GUI import?