DD – For our previous big dataset (Industry Benchmark), we had issues submitting a dataset containing 70k molecules adding up to about 500k “entities”. Due to some technical debt in metadata handling, a large data blob needs to pass through several interfaces, and some interfaces have hit size limits. So until this process gets improved, it may make sense to split the dataset into separate blobs and stitch them together later.
PE – There will be conceptual divisions in the dataset, and we can add technical divisions as needed. Some datasets will only be ~10k conformations. But others will be based on enamine and have >1 million conformations.
DD – For now the safe thing to do would be to stay under 250k entities per dataset.
PE – Next will be drugbank, likely on the same order of size. Then there’s DS370k, which will also be large.
DD – SBoothroyd had suggested making a “test set” to test out the submission process without having a large resource requirement.
PE – My current dipeptide submission will be under a thousand molecules, just a few confs each.
DD – Sounds good. It’s just important to stay under 250k entities, and remember that different compute specs will multiply the number of entities.
PE – This submission should just be a few 10k entites.
PE – Most of my datasets moving forward will be ~10ks , some in the hundreds, and one or two in the millions. But I’m happy to split those up for technical reasons. Timeline for fix?
DD – Earliest would be deploying the fixes around the end of 2021. But the timeline is fairly uncertain, and this is happening with the background of a more significant refactor.
PE – Happy to start with smaller datasets as we increase dataset size, so this may line up with the schedule anyway.