There's no lockfile or anything with this approach right? So in a year or two all of these scripts will be broken because people didn't pin their dependencies?
> So in a year or two all of these scripts will be broken because people didn't pin their dependencies?
People act like this happens all the time but in practice I haven't seen evidence that it's a serious problem. The Python ecosystem is not the JavaScript ecosystem.
I think it's because you don't maintain much python code, or use many third party libraries.
An easy way to prove that this is the norm is to take some existing code you have now, and update to the latest versions your dependencies are using, and watch everything break. You don't see a problem because those dependencies are using pinned/very restricted versions, to hide the frequency of the problem from you. You'll also see that, in their issue trackers, they've closed all sorts of version related bugs.
Are you sure you’re reading what I wrote fully? Getting pip, or any of them, to ignore all version requirements, including those listed by the dependencies themselves, required modifying source, last I tried.
I’ve had to modify code this week due to changes in some popular libraries. Some recent examples are Numpy 2.0 broke most code that used numpy. They changed the c side (full interpreter crashes with trimesh) and removed/moved common functions, like array.ptp(). Scipy moved a bunch of stuff lately, and fully removed some image related things.
If you think python libraries are somehow stable in time, you just don’t use many.
... So if the installer isn't going to ignore the version requirements, and thereby install an unsupported package that causes a breakage, then there isn't a problem with "scripts being broken because people didn't pin their dependencies". The packages listed in the PEP 723 metadata get installed by an installer, which resolves the listed (unpinned) dependencies to concrete ones (including transitive dependencies), following rules specified by the packages.
I thought we were talking about situations in which following those rules still leads to a runtime fault. Which is certainly possible, but in my experience a highly overstated risk. Packages that say they will work with `foolib >= 3` will very often continue to work with foolib 4.0, and the risk that they don't is commonly-in-the-Python-world considered worth it to avoid other problems caused by specifying `foolib >=3, <4` (as described in e.g. https://iscinumpy.dev/post/bound-version-constraints/ ).
The real problem is that there isn't a good way (from the perspective of the intermediate dependency's maintainer) to update the metadata after you find out that a new version of a (further-on) dependency is incompatible. You can really only upload a new patch version (or one with a post-release segment in the version number) and hope that people haven't pinned their dependencies so strictly as to exclude the fix. (Although they shouldn't be doing that unless they also pin transitive dependencies!)
That said, the end user can add constraints to Pip's dependency resolution by just creating a constraints file and specifying it on the command line. (This was suggested as a workaround when Setuptools caused a bunch of legacy dependencies to explode - not really the same situation, though, because that's a build-time dependency for some packages that were only made available as sdists, even pure-Python ones. Ideally everyone would follow modern practice as described at https://pradyunsg.me/blog/2022/12/31/wheels-are-faster-pure-... , but sometimes the maintainers are entirely MIA.)
> Numpy 2.0 is a very recent example that broke most code that used numpy.
This is fair to note, although I haven't seen anything like a source that would objectively establish the "most" part. The ABI changes in particular are only relevant for packages that were building their own C or Fortran code against Numpy.
> `foolib >= 3` will very often continue to work with foolib 4.0,
Absolute nonsense. It's industry standard that major version are widely accepted as/reserved for breaking changes. This is why you never see >= in any sane requirements list, you see `foolib == 3.*`. For anything you want to work for a reasonable amount of time, you see == 3.4.*, because deprecations often still happen within major versions, breaking all code that used those functions.
Breaking changes don't break everyone. For many projects, only a small fraction of users are broken any given time. Firefox is on version 139 (similarly Chrome and other web browsers); how many times have you had to reinstall your plugins and extensions?
For that matter, have you seen any Python unit tests written before the Pytest 8 release that were broken by it? I think even ones that I wrote in the 6.x era would still run.
For that matter, the Python 3.x bytecode changes with every minor revision and things get removed from the standard library following a deprecation schedule, etc., and there's a tendency in the ecosystem to drop support for EOL Python versions, just to not have to think about it - but tons of (non-async) new code would likely work as far back as 3.6. It's not hard to avoid the := operator or the match statement (f-strings are definitely more endemic than that).
Agreed, this is a big problem, and exactly why people pin their dependencies, rather than leaving them wide open: pinning a dependency guarantees continued functionality.
If you don't pin your dependencies, you will get breakage because your dependencies can have breaking changes from version bumps. If your dependencies don't fully pin, then you they will get breaking changes from what they rely on. That's why exact version numbers are almost always pinned for something distributed, because it's a frequent problem that you don't want the end user having to deal with.
Again, you don't see this problem often because you're lucky: you've installed at a time when the dependencies have already resolved all the breakage or, the more common case, the dependencies were pinned tight enough that those breaking changes were never an issue. In other words, everyone pinning their dependencies strict enough is already the solution to the problem. The tighter the restriction, the more guarantee of continued functionality.
1. Great Python support. Piping something from a structured data catalog into Python is trivial, and so is persisting results. With materialization, you never need to recompute something in Python twice if you don’t want to — you can store it in your data catalog forever.
Also, you can request anything Python package you want, and even have different Python versions and packages in different workflow steps.
2. Catalog integration. Safely make changes and run experiments in branches.
3. Efficient caching and data re-use. We do a ton of tricks behind to scenes to avoid recomputing or rescanning things that have already been done, and pass data between steps with Arrow zero copy tables. This means your DAGs run a lot faster because the amount of time spent shuffling bytes around is minimal.
To me they seem like the pythonic version of dbt! Instead of yaml, you write Python code. That, and a lot of on-the-fly computations to generate an optimized workflow plan.
Plenty of stuff in common with dbt's philosophy. One big thing though, dbt does not run your compute or manage your lake. It orchestrate your code and pushes it down to a runtime (e.g. 90% of the time Snowflake).
This IS a runtime.
You import bauplan, write your functions and run them in straight into the cloud - you don't need anything more. When you want to make a pipeline you chain the functions together, and the system manages the dependencies, the containerization, the runtime, and gives you a git-like abstractions over runs, tables and pipelines.
You technically just need storage (files in a bucket you own and control forever).
We bring you the compute as ephemeral functions, vertically integrated with your S3: table management, containerization, read / write optimizations, permissions etc. is all done by the platform, plus obvious (at least to us ;-)) stuff like preventing you to run a DAG that is syntactically incorrect etc.
Since we manage your code (compute) and data (lake state through git for data), we can also provide full auditing with one liners: e.g. "which specific run change this specific table on this data branch? -> bauplan commit ..."
I have worked with poetry professionally for about 5 years now and I am not looking back. It is exceptionally good. Dependency resolution speed is not an issue beyond the first run since all that hard to acquire metadata is actually cached in a local index.
And even that first run is not particularly slow - _unless_ you depend on packages that are not available as wheels, which last I checked is not nearly as common nowadays as it was 10 years ago. However it can still happen: for example, if you are working with python 3.8 and you are using the latest version of some fancy library, they may have already stopped building wheels for that version of python. That means the package manager has to fall back to the sdist, and actually run the build scripts to acquire the metadata.
On top of all this, private package feeds (like the one provided by azure devops) sometimes don't provide a metadata API at all, meaning the package manager has to download every single package just to get the metadata.
The important bit of my little wall of text here though is that this is all true for all the other package managers as well. You can't necessarily attribute slow dependency resolution to a solver being written in C++ or pure python, given all of these other compounding factors which are often overlooked.
I will! I'm sure it's faster when the data is available. But when it's not, in the common circumstances described above, network and disk IO are still the same unchanged bottlenecks, for any package manager.
In conversations like this, we are all too quick to project our experiences on the package managers and not sharing in what circumstances we are using them.
Doesn't this potentially create security problems if process lifetime is very long? Changes to the certificate store on the system will potentially not be picked up?
Yes. And not just a security problem but an operational problem, since if you have to rotate a trust anchor you might have a hard time finding and restarting all such long-lived processes.
IMO SSL_CTX_load_verify_locations() should reload the trust store when it changes, though not more often than once a minute. IMO all TLS libraries should work that way, at least when the trust anchors are stored in external systems that can be re-read (e.g., files, directories, registries, etc.).
Apps can do something like that by re-creating an SSL_CTX when the current one is older than some number of minutes.
On practice, we are talking about the root certificates store. That thing that organizations update every 10 or 20 years. Every other year there's an update there, because there are a few of them, but your "very long" there uses a strong "very".
Well, it doesn't necessarily have to be 10 or 20 years long, all it takes is for the timeframe to overlap with a certificate being revoked, I guess. Process lifetimes of a few months are definitely not uncommon. Anyway, I can see the tradeoff. There just needs to be a mechanism to disable this performance optimization, or to invalidate the cache (e.g. periodically).
reply