This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Slow XML import

Former Member over 7 years ago

Hia,

We have a conversion mechnism running that's converting legacy postscript output to generic PDF with formatted XML (1 pdf = 1 pdf, same naming).
On a daily base about 5000 files are generated and imported with an external file source job using the XML and xpaths for meta data.
It's all working nicely, however, the import into M-Files is so slo...o..w.....

Any experience here to speed up this process, to increase the # of files per import (for example 500 in stead of 100) etc?
The (virtual) hardware shouldn't be the issue, nor the SQL vault database.

We have a backlog of let's say a few million files, so any speed increase would be great.

PS: best wishes all!

Joonas Linkola over 7 years ago

Are you using text recognition (OCR) in the file source job? OCR processing utilizes a large amount of available CPU resources during import which can slow it down. If you have OCR enabled, try running the import job with it disabled and see what effect that has. If the source PDFs are already text searchable OCR processing is unnecessary anyway.

If you are not using OCR, then could you please describe what is the current speed of the import? How long does it take to import, say, 100 documents?
Cancel
Vote Up 0 Vote Down

Cancel
Former Member over 7 years ago

Hia,

OCR is not used in this case and can be ignored (as test: importing flat Tif image + XML files has the same performance).
Importing 100 documents lies between 100 to 150 seconds, so basically 1/1,5 second a file, or about 100.000 per day when running full time.

That is a lot ofcourse, but when you speak of millions, it means we can import about 3 million files per 1 month, meaning we need 4-5 months of full time importing.
Cancel
Vote Up 0 Vote Down

Cancel
Former Member over 7 years ago

Are you using the import tool?

The GUI has a memory leak that causes it to slow down to a crawl, if you can use CLI version it is much faster.
Cancel
Vote Up 0 Vote Down

Cancel
Joonas Linkola over 7 years ago

Importing 100 documents lies between 100 to 150 seconds, so basically 1/1,5 second a file, or about 100.000 per day when running full time.

That sounds about normal, we usually estimate importing speed to be ~1 document/second if we don't have details about the environment. The actual speed depends on a lot of variables such as the hardware specs, the network, the complexity of the metadata structure and any event handlers and other scripts that need to run when the document is added to the vault etc.
Cancel
Vote Up 0 Vote Down

Cancel
Former Member over 7 years ago

That sounds about normal, we usually estimate importing speed to be ~1 document/second if we don't have details about the environment. The actual speed depends on a lot of variables such as the hardware specs, the network, the complexity of the metadata structure and any event handlers and other scripts that need to run when the document is added to the vault etc.

The environment is no issue. All data is local for the server and there's nothing complex about plain text property import. Server CPU isn't going over 10% and memory is barely dented for the process.

What would be the impact if we duplicate the import job to a second or third, and do parallel imports? Is M-Files ready for concurrent threads in this manner and would that increase import speed?
Cancel
Vote Up 0 Vote Down

Cancel
Former Member over 7 years ago

something strange.

I manually duplicated the import job to a job_2. Everything is exactly the same, except for the name_2 and the source folder_2 (which is also local just like first job).

So now I have the jobs: Import_Converted_Files & Import_Converted_Files_2

The second job_2 imports like 5-10 times faster than the first job. Exactly the same files, meta data etc.
How??

Is M-Files doing something funny with the external source name property? It feels like that there is some form of ever growing registry causing the job to become increasingly slower?

Any clue? or is this a glitch in the software?
Cancel
Vote Up 0 Vote Down

Cancel
Former Member over 7 years ago

And the same again. 2nd job is slowing down.
Creating a new, identical, third job_3 will import much faster again, about 5x faster.
Cancel
Vote Up 0 Vote Down

Cancel
Joonas Linkola over 7 years ago

What would be the impact if we duplicate the import job to a second or third, and do parallel imports? Is M-Files ready for concurrent threads in this manner and would that increase import speed?

If there is a logical way to split the importing to several jobs that might speed things up since each import job runs in its own thread I think. However I'm not that familiar with the technical details of importing so you could check this from technical support if you have a maintenance subscription. Support may also have an explanation for why the importing slows down after a while.
Cancel
Vote Up 0 Vote Down

Cancel
Former Member over 7 years ago

I contacted support and will post any findings.
Cancel
Vote Up 0 Vote Down

Cancel
Former Member over 7 years ago

Have you tried using the non-GUI import?
Cancel
Vote Up 0 Vote Down

Cancel

Slow XML import

Contact Us

Schedule a Demo

Careers

Trust Center

Privacy Policy

Security Hall of Fame

M-Files Community

Support Portal

Help Center

Product Center

Download M-Files

User Guides

Product News