Need to speed up metadata extraction

I have a project where the client wants to get the metadata out of one of their m-files classes.  Currently, the vault has 1.96 million files.  All I am trying to pull is the metadata for each file.  My tool runs and it was running much faster a few days ago but yesterday it slowed dramatically.  My tool was estimating that it would take about 3 days to get all the metadata and now it is 16 days.  So, my questions are, what causes the change in performance and what are some best practices and maybe COM functions that would make this better.  They will be doing this multiple times a year.  I understand why the download takes some time but not sure if there are some better classes/object to bulk download metadata

My metadata extraction uses the following code

public string GetMetadataAsJson(ObjectVersion objectVersion, Vault vault, string filePath, string className)
{

var metadataDict = new Dictionary<string, object>
{
["MFilesFolderPath"] = filePath,
["ClassName"] = className
};


PropertyValues properties = vault.ObjectPropertyOperations.GetProperties(objectVersion.ObjVer);


var propertyDefsCache = new Dictionary<int, string>();

foreach (PropertyValue propertyValue in properties)
{

if (!propertyDefsCache.TryGetValue(propertyValue.PropertyDef, out string propertyDefName))
{
PropertyDef propertyDef = vault.PropertyDefOperations.GetPropertyDef(propertyValue.PropertyDef);
propertyDefName = propertyDef.Name;
propertyDefsCache[propertyValue.PropertyDef] = propertyDefName;
}


metadataDict[propertyDefName] = propertyValue.Value.DisplayValue;
}

return JsonConvert.SerializeObject(metadataDict, Formatting.Indented);
}

Parents
  • A couple of things:

    • I definitely agree with  : convert this to retrieve data in batches.  For reading data batches in the hundreds should work.  It won't make it hundreds of times faster, but it'll make it faster.
    • Your current code seems to populate that propertyDefsCache for each object, which means tens of millions (at least - possibly hundreds of millions) of additional vault queries to get this data.  My suggestion would instead be to retrieve all of the properties once, first, populate a static dictionary instead, then use that cache inside your GetMetadataAsJson method.
  • As a newbie, appreciate all the comments and insight.  I also am downloading each file in the same manner.  Is there a batch download that would correspond to the metadata batch?  I download all the file names in batches as a part of my out loop wherein this code is contained and iterate through that list but perhaps I should just do everything in batches.  

  • The actual file download will need to be one-by-one, unfortunately. You may be able to build something which creates a queue of items to download, though, and download multiple in parallel...? 

  • How about this, Make a view that has your files. Remove the view limit, remove the timeout from preferably server or any client and right click and download everything in said view ? Shouldn't take longer than 16 days and you would have the files and the metadata

    or you can go for segmentation for data like, make multiple views based on creation date of the files. And do the export operation

Reply
  • How about this, Make a view that has your files. Remove the view limit, remove the timeout from preferably server or any client and right click and download everything in said view ? Shouldn't take longer than 16 days and you would have the files and the metadata

    or you can go for segmentation for data like, make multiple views based on creation date of the files. And do the export operation

Children
No Data