Stavros form TileDB here. Here is a more verbose explanation. Up until before 2.0, TileDB was already powerful for the main applications we targeted at: geospatial and genomics. The support for both dense and sparse arrays and the way it handles data versioning made it quite unique vs. HDF5 and Zarr. But we noticed that most of the data scientists we were working with had a lot of data beyond genomic variants, LiDAR points and rasters. They had tons of dataframes. And they were using at least two storage engines, TileDB for arrays, and Parquet or a relational database for dataframes. If you are in a large organization, this a big pain.
In TileDB 2.0 we made a huge refactoring to support something seemingly simple: dimensions in sparse arrays that can have different types and that could even be strings. This allowed us to model any dataframe as a sparse array, effectively making TileDB act as a primary multi-dimensional index. In relational databases, this means that your data is sorted in an order on disk that favors your multi-column slicing enormously, so range search becomes rapid.
Therefore, what we are telling the community with this release is that you can have dense arrays, sparse arrays, and dataframes in a single embeddable library being integrated with pretty much every data science tool out there, so that data scientists never have to worry about backends, files, updates, or anything other than their scientific analysis. In other words, we believe the future of data science is more science.
In TileDB 2.0 we made a huge refactoring to support something seemingly simple: dimensions in sparse arrays that can have different types and that could even be strings. This allowed us to model any dataframe as a sparse array, effectively making TileDB act as a primary multi-dimensional index. In relational databases, this means that your data is sorted in an order on disk that favors your multi-column slicing enormously, so range search becomes rapid.
Therefore, what we are telling the community with this release is that you can have dense arrays, sparse arrays, and dataframes in a single embeddable library being integrated with pretty much every data science tool out there, so that data scientists never have to worry about backends, files, updates, or anything other than their scientific analysis. In other words, we believe the future of data science is more science.