Page Comparison

...

Discussion topics

Item

Notes

Notes from yesterday’s VU Collaboration meeting

unyt is virtually the same speed as numpy for pure array math, pint is slower (10-100x) across the board
Need to better clarify each group’s MUST, MAY, MUST NOT
- Serialization
  - Compression
  - Performance
- Raw array performance
- Import time
- Dependency load
- Interfaces with existing code(s)

Previous experiences

SB – Used Pint and simtk. Importing can be a bit slow. Was unmaintianed for a bit but it’s maintained again nowl.
- Downside is that it uses dynamic classes, so each registry has a new version of every class. So if you make two basically-identical registries, then things won’t be equal between them.
- Pint leaves serialization to you. It does to_string and from_string seamlessly. Lists and arrays are still hard. Not sure about time-performance.
- For defining your own units, it lets you add things like kilojoules per mole, and make them not get reduced by default
- Should be NEP-18 compliant (supports all numpy matrix operations)
  - Github link macro
    link https://github.com/hgrecco/pint/pull/905
- Building in support for dask, xarray, pandas, etc
- If you have two unit registries, you can’t convert between them in python objects. Instead here you need to convert to string as an intermediary.
  - DD – Are there any cases where you’re dealing with more than one registry in a process?
  - SB – Not two pint usnit systems, but to go to simtk requires using string.
- Maintainer seems pretty good and active.
- JW – kJ/mol units/dimension issue.
  - (General) – This isn’t a goalpost that we should set up, since it’s so badly mangled.
Nice things about simtk?
- Already pulled in with out dependencies
- Fast import time
- Allows use of multiple registries (so, like, one for AMBER, another for GROMACS)
JW – One thing I want to avoid is that a serialized representation should be unambiguous as to whether objects should be unpacked into a quantity, versus a string that COULD be turned into a Quantity, but should actually stay as a string
- SB – It’s better if you have that be position dependent – ie, the deserialization routing/class structure should indicate which fields MUST be deserialzied into a Quantitiy, and everything else should be left in the form that it is in the dict.
MT – Nothing unique to say about simtk units. Same with Pint. I’ll talk about unyt
- At its core, base class is a thin wrapper around numpy arrays. Started as a submodule of “yt”, a big astrophysics package. Got spun out two years ago.
- Has the most efficient raw array performance – basically equivalent to numpy. About two orders of magnitude faster than pint. Though I’m not sure if this is important for us
- If we ever encounter a situation where we need to do million-element multiplication, then we can strip+reattach units before and after
- No concept of unit registry. Can’t define custom units
- Size is pretty small (few hundred kb, very fast imports once numpy’s already in)
- Maintainer status: Under umbrella of larger project. Though core maintainer recently moved to a different position, and seemed somewhat rude to certain new contributors.
- Not super familiar with actual usage, not nearly to the same level as SB+Pint.
- Fairly baked into other VU software. Though the only place where units are required in objects in a cassandra wrapper program that uses unyt. It’d be hard to get them over to Pint.
- SB - note that Pint has implemented NEP-18 (__array_function__)
  - Seems that unyt does not (yet?):
    Github link macro
    link https://github.com/yt-project/unyt/issues/139
How do we want to handle quantities in equality operations? Should it be equivalence or identity?
- (General) – Not necessary to handle this now

Goalpost-defining decisions

Consistent vs hereogenous internal units?
- What do we expect serialized representation to look like? Implicit units vs. explicit?
- JW – Implicit will lose track of original units – Goes back to question of whether serialziation roundtrip should yield “exact” object or “equivalent” object.
- (General) – Should objects like Molecule, ForceField, and System have implicit internal units that everything gets converted to, or support hereteogenous units?
  - Molecule doesn’t need to support hetereogenous units
  - FF should probably support heterogenous internal units
  - System is undecided
  - MolSSI stack (probably) casts to one unit for each field
  - Evaluator stack casts to internal units
- (General) – Standard isn’t “infectious” – If one object enforces consistent internal units, then other objects can make a different choice, and vice versa
- (General) – Expectation of human readability affects how each object should behave. In cases where a serailzied representation is expected to have some relationship to a “primary source”, then the object may need to support heterogenous internal units, to support comparison to a variety of primary sources.
- Each object can decide for itself whether internal representation enforces consistent units or allows hereogeneity.
  - This means that deserailization performance isn’t super important. If it’s super painful, then an object could use implicit units and gets deserialization for much cheaper.
Array of structures vs structure-wrapped array?
- JW – In the worst case, we’ll need to support things equivalent to “array of structures” (eg, saving a million 3D molecules). Is it worth trying to distinguish ahead of time how frequently each case will come up?
Pydantic support? JSON-schema/friendliness
- Neither is natively pydantic-compatible
Closer to MolSSI/Pint or VU/Unyt?
- (General) – We’re closer to MolSSI
- Due to UnitRegistry differences, what operations will be allowed between OFF stack and MolSSI stack? How concerned should we be about CODATA differences?
- https://pint.readthedocs.io/en/stable/tutorial.html?highlight=global#using-pint-in-your-projects
- Are CODATA differences important?
  - MT – There’s already large uncertainties in chemical measurements. Are CODATA version differences actually significant on this scale?
  - JW – SB uses highly precise physical chemistry datasets that have eg. density defined out to lots of sig figs
- JW – Can CODATA differences be encoded in the serialized representation?
- DD – Organizationally, we should define our own registry, since that will reduce friction and make our development not depend on the actions of other groups.

Setting up goalposts

Meet in 30 minutes (9:15 Pacific, 10:15 AZ, 11:15 central) to report on the following:

Maintainer activity/project status – JW
Serialization speed/fidelity/memory efficiency – JW
1. SB – This will only be significant if we do the “array of structures” approach. A “structure of array” approach won’t incur this cost.
2. unyt has a story on serialization: https://unyt.readthedocs.io/en/stable/usage.html?highlight=pint#writing-data-with-units-to-disk
Interoperability with other software stacks – DD
Unit systems/registries (eg, adding support for hartrees, kcal/mol) – DD
1. unyt is more astrophysics-oriented, missing units like hartree
2. pint appears more general, has e.g. hartree and rydberg_constant included
3. Both packages allow definition of custom units based on other units already defined
4. Both packages support registries; unyt has a concept of a global registry that it uses by default, pint does not
  1. SB – Pint may actually load a default registry. Not positive, should test.
  2. DD – Different instances of registries aren’t seen as equivalent
  3. SB – May be the same in unyt.
  4. JW – If we use pint, do we need to use the QCEl registry? Otherwise would equality comparisons fail?
  5. SB – Depends on whether QCEl carries things around as Quantity classes, and I’m not sure that they do. Also units named the same thing may not actually be the same depending on which CODATA is used
  6. DD – (did example showing that Pint units from different registries can’t be compared, even though Quantities can, and that there’s a default pint registry that quantities can be created from)
Dependency size/complexity (eg. number of deps and pins) – MT
1. unyt pulls in SymPy, Pint pulls in virtually nothing new
  1. Github link macro
    link https://gist.github.com/mattwthompson/8f558a9ecc04f20d6570f3c2de34a820
2. Unyt = small, but sympy dep is 10MB.
3. Pint = 100-200 kB, no deps
Performance during array operations – MT
1. Can more or less trust the unyt JOSS paper. Its raw performance is nearly identical to NumPy, Pint is significantly slower.
  1. https://joss.theoj.org/papers/10.21105/joss.00809
  2. DD – Doesn’t seem to be a significant difference for our use cases – This computational cost won’t be a significant factor in overall our runtime
  3. MT – Agree, this doesn’t seem to be a critical decision point.
~~Dependency load time – MT~~
1. SB – This is probably easy to optimize for pint, since we’ll need to ship our own units file to match with our chosen CODATA

Versions Compared

Old Version 1

New Version Current

Key

Discussion topics

Action items

Decisions