This is an amazing quote - thank you. This is also my argument for why I can't use LLMs for writing (proofreading is OK) - what I write is not produced as a side-effect of thinking through a problem, writing is how I think through a problem.
Counterpoint (more devil's advocate), I'd argue it's better than an LLM writes something (e.g. the solution or thinking through of a problem) than nothing at all.
Counterpoint to my own counterpoint, will anyone actually (want to) read it?
counterpoint to the third degree, to loop it back around, an LLM might and I'd even argue an LLM is better at reading and ingesting long text (I'm thinking architectural documentation etc) than humans are. Speaking for myself, I struggle to read attentively through e.g. a document, I quickly lose interest and scan read or just focus on what I need instead.
I kinda saw this happen in realtime on reddit yesterday. Someone asked for advice on how to deal with a team that was in over their heads shipping slop. The crux of their question was fair, but they used a different LLM to translate their original thoughts from their native language into English. The prompt was "translate this to english for a reddit post" - nothing else.
The LLM adding a bunch of extra formatting to add emphasis and structure to what might have originally been a bit of a ramble, but obviously human written. The comments absolutely lambasted this OP for being a hypocrite complaining about their team using AI, but then seeing little problem with posting what is obviously an AI generated question because the OP didn't deem their English skills good enough to ask the question directly.
I'm not going to pass judgement on this scenario, but I did think the entire encounter was a "fun" anecdote in addition to your comments.
I saw the same post and was a bit saddened that all the comments seemed to be focused on the implied hypocrisy of the OP instead of addressing the original concern.
As someone that’s a bit of a fence-sitter on the matter of AI, I feel that using it in the way that OP did is one of the less harmful or intrusive uses.
I see it as worse because you could have put just as much effort in - less even - and gotten a better result just sticking it in a machine translator and pasting that.
Writing is how I think through a problem too, but that also applies to writing and communicating with an AI coding agent. I don't need to write the code per se to do the thinking.
You could write pseudocode as well. Bit fo someone who is familiar with a programming language, it’s just faster to use the latter. And if you’re really familiar with the language, you start thinking in it.
I once modeled user journeys on a website using fancy ML models that honored sequence information, i.e., order of page visits, only to be beaten by bag-of-words (i.e., page url becomes a vector dimension, but order is lost) decision tree model, which was supposed to be my baseline.
What I had overlooked was that journeys on that particular website were fairly constrained by design, i.e., if you landed on the home page, did a bunch of stuff, put product X in the cart - there was pretty much one sequence of pages (or in the worst case, a small handful) that you'd traverse for the journey. Which means the bag-of-words (BoW) representation was more or less as expressive as the sequence model; certain pages showing up in the BoW vector corresponded to a single sequence (mostly). But the DT could learn faster with less data.
I was not aware this existed and it looks cool! I am definitely going to take out some time to explore it further.
I have a couple of questions for now:
(1) I am confused by your last sentence. It seems you're saying embeddings are a substitute for clustering. My understanding is that you usually apply a clustering algorithm over embeddings - good embeddings just ensure that the grouping produced by the clustering algo "makes sense".
(2) Have you tried PaCMAP? I found it to produce high quality and quick results when I tried it. Haven't tried it in a while though - and I vaguely remember that it won't install properly on my machine (a Mac) the last time I had reached out for it. Their group has some new stuff coming out too (on the linked page).
We generally run UMAP on regular semi-structured data like database query results. We automatically feature encode that for dates, bools, low-cardinality vals, etc. If there is text, and the right libs available, we may also use text embeddings for those columns. (cucat is our GPU port of dirtycat/skrub, and pygraphistry's .featurize() wraps around that).
My last sentence was on more valuable problems, we are finding it makes sense to go straight to GNNs, LLMs, etc and embed multidimensional data that way vs via UMAP dim reductions. We can still use UMAP as a generic hammer to control further dimensionality reductions, but the 'hard' part would be handled by the model. With neural graph layouts, we can potentially even skip the UMAP for that too.
Re:pacmap, we have been eyeing several new tools here, but so far haven't felt the need internally to go from UMAP to them. We'd need to see significant improvements given the quality engineering in UMAP has set the bar high. In theory I can imagine some tools doing better in the future, but the creators have't done the engineering investment, so internally, we rather stay with UMAP. We make our API pluggable, so you can pass in results from other tools, and we haven't heard much from that path from others.
Thank you. Your comment about LLMs to semantically parse diverse data, as a first step, makes sense. In fact come to think of it, in the area of prompt optimization too - such as MIPROv2 [1] - the LLM is used to create initial prompt guesses based on its understanding of data. And I agree that UMAP still works well out of the box and has been pretty much like this since its introduction.
- parity or improvement on perf, for both CPU & GPU mode
- better support for learning (fit->transform) so we can embed billion+ scale data
- expose inferred similarity edges so we can do interactive and human-optimized graph viz, vs overplotted scatterplots
New frontiers:
- alignment tooling is fascinating, as we increasingly want to re-fit->embed over time as our envs change and compare, eg, day-over-day analysis. This area is not well-defined yet common for anyone operational so seems ripe for innovation
- maybe better support for mixing input embeddings. This seems increasingly common in practice, and seems worth examining as special cases
Always happy to pair with folks in getting new plugins into the pygraphistry / graphistry community, so if/when ready, happy to help push a PR & demo through!
> alignment tooling is fascinating, as we increasingly want to re-fit->embed over time as our envs change and compare, eg, day-over-day analysis. This area is not well-defined yet common for anyone operational so seems ripe for innovation
Training the parametric UMAP is a little more expensive, but the new landmarked based updating really does allow you to steadily update with new data and have new clusters appear as required. Happy to chat as always, so reach out if you haven't already looked at this and it seems interesting.
I haven't; from a quick reading, InfoBax is for when you have an expensive function and want to do limited evaluations.
Timefold works with cheap functions and does many evaluations.
Timefold does this via Constraint Streams, so a function like:
var score = 0;
for (var shiftA : solution.getShifts()) {
for (var shiftB : solution.getShifts()) {
if (shiftA != shiftB && shiftA.getEmployee() == shiftB.getEmployee() && shiftA.overlaps(shiftB)) {
score -= 1;
}
}
}
return score
usually takes shift * shift evaluations of overlaps, we only check the shifts affected by the change (changing it from O(N^2) to O(1) usually).
That being said, it might be useful for a move selector. I need to give it a more in depth reading.
Thanks for the example. Yes, true, this is for expensive functions - to be precise functions that depend on data that is hard to gather, so you interleave the process of computing the value of the function with gathering strategically just as much data as is needed to compute the function value. The video on their page [1] is quite illustrative: calculate shortest path on a graph where the edge weights are expensive to obtain. Note how the edge weights they end up obtaining forms a narrow band around the shortest path they find.
You don't - the way I use LLMs for explanations is that I keep going back and forth between the LLM explanation and Google search /Wikipedia. And of course asking the LLM to cite sources helps.
This might sound cumbersome but without the LLM I wouldn't have (1) known what to search for, in a way (2) that lets me incrementally build a mental model. So it's a net win for me. The only gap I see is coverage/recall: when asked for different techniques to accomplish something, the LLM might miss some techniques - and what is missed depends upon the specific LLM. My solution here is asking multiple LLMs and going back to Google search.
Love awk. In the early days of my career, I used to write ETL pipelines and awk helped me condense a lot of stuff into a small number of LOC. I particularly prided myself in writing terse one-liners (some probably undecipherable, ha!); but did occasionally write scripts. Now I mostly reach for Python.
Wouldn't this be an optimization problem, that's to say, something like z3 should be able to do - [1], [2]?
I was about to suggest probabilistic programming, e.g., PyMC [3], as well, but it looks like you want the optimization to occur autonomously after you've specified the problem - which is different from the program drawing insights from organically accumulated data.
reply