A port of Andrej Karpathy's llama2.c to pure Julia. Works on Llama2 and Andrej's tinyllamas models. REPL mode allows interaction from the Julia console
This question keeps me up at night, as I presume it does for my colleagues at Julia Computing. Not that switching to a fully commercial model is necessarily a bad thing, but Julia Labs, like any academic group, always has to worry about where funding will come from.
I think that the problem is that the old model (proprietary development) is pretty unworkable too - but the effects are distributed less starkly. In the old days we had compilers from various vendors at a range of obsolescence, source code would work under a particular compiler and the compiler license was associated with that component. There was no budget for buying a new compiler for a particular project and maintenance gradually got worse and worse, meaning things had to be rebuilt eventually. Good for developers, good for vendors, bad for the bottom line and wider economic development. I wonder what the right size of operation for implementing, innovating and supporting a project like Julia is (not what people would dream of, but the operation that would just about do the job effectively) and I wonder what models could be created to sustain that kind of operation over the right kind of timescale.
See Slide 4 for why it's important to be open source.
TL;DR: Researcher A finds things he wants improved in Magma (closed source) but can't. Researcher B tries to write improved FOSS implementation, but lost his job, likely in part by spending too much time writing said code and not doing other things like writing papers. Researcher A moves on and has a successful academic career. Moral: writing FOSS can cost you your academic job; it's safer to find something else to do.
I have tried for a while now, and I thought I could do both. But... (1) It is difficult on a personal level--for example last month SageMathCloud got hit by a major DDOS attack 15 minutes before I had to teach a class. I have family and though I love to work, there are only so many hours in a day. (2) There I am at a big old state university, and there are many complicated byzantine conflict of interest and IP rules, which have been a pain to navigate, and our university commercialization office isn't the best. (3) Investors greatly prefer that the person/company they are investing in is not just a side project for the person running it. All that said, the mathematics department at University of Washington is full of supportive faculty; I'm doing what I'm doing more for the people I want to hire than just for myself.
Why not take a leave of absence to at least get the company started and acquire some funding? After that, you could just have a consulting role with the company.
I did during my 2014-2015 sabbatical. Building a successful company is vastly more difficult and demanding of attention than I could have imagined. Maybe I'm just not as good at doing multiple difficult things at once as other people.
He has a higher risk tolerance than me. I would work on this 80% and do the 20% required stuff as a tenured professor, but what do I know. Maybe I am overly enamored with becoming a tenured professor. If he was already spending his time as a professor working on this, I don't understand the difference. Still, all the best and good luck to him.
The work of a tenured professor is more than 20% time. A normal teaching load is 4-5 classes a year, plus significant committee work, student advising, etc. It takes an enormous amount of time. And it can be awesome, fun, and many of my colleagues love doing it. But it doesn't result in creating a free open source alternative to Mathematica.
I think people underestimate how much work a professor does. Everybody in my department who started a company either did it before starting at the university or they did it while on sabbatical.
If you do the bare minimum work as a tenured professor you're going to get an awful lot of people extremely mad at you at all levels of the academic hierarchy.
If you do the bare minimum work and you're spending a lot of time on your own company, at my university, you will almost certainly get a pink slip. Doesn't matter if you have tenure. We are not allowed to work more than one day a week on such things, and that requires approval, which probably won't happen if we're not publishing, have terrible teaching evaluations, and aren't doing a full share of service and advising.
Priority is tricky to nail down e.g. the EDSAC was operational a year before the Mark 1 (which actually was not operational until 1949). Because of the "B" Williams Tube which held the two index registers of the Mark 1, many other manufacturers -- e.g. Burroughs -- later called their index registers "B registers". (Also, I think the Univac I and II were successful commercially, and earlier than the 704.)
I started programming in earnest and for money in 1961 mostly in machine code on a variety of machines, and I'm pretty sure that the notions of single dimensional sequential locations of memory as structures (most often called arrays, but sometimes called lists), predated the idea of index registers. This is because I distinctly remember being first taught how to move through arrays sequentially just by direct address modification -- this was called indexing in the Air Force -- and later being introduced to the index register or registers -- depending on the machine.
My strong guess (from above) is that the array as sequential memory accessed by address modification of some kind was a primary idea in the very first useful computers in the late 40s, and that this idea pre-dated index registers.
It would be well worth your time to look at the instruction set of EDSAC, and try writing some programs.
The title "This guy’s arrogance takes your breath away" is taken directly from Backus's own description of this collection of letters. I've changed the title to make clearer that it is a direct quotation.
Unfortunately, the Library of Congress does not allow scanners without prior approval, so this was the only way I could make my own copies. It did not help that all these letters were written or typed on very thin mimeograph paper.
Thank you for going to the trouble of sharing this amazing material! I'd also like to hear about the historical paper you're working on.
We've taken the "arrogance" quote out of the HN title and replaced it with a neutral description of the letters—perhaps a bit too neutral, given how fabulous the post is. But HN readers can figure that out for themselves, especially once a post gets so high on the front page.
The paper is about, among other things, the history of the array data structure. It's far too early to advertise, but you can see a very early version on my GitHub account. :)
I was surprised to discover recently that the word 'array' prior to 1950 was used exclusively to describe two dimensional tables of numbers that one might find in a matrix or determinant. But by the advent of FORTRAN I in 1957 and ALGOL 58, 'array' now referred exclusively to a one-dimensional entity, as compared with 'n-dimensional arrays'. I was interested in digging through John Backus's papers from this era to see if I could find any clues.
I was able to narrow down the near window to 1952-1954, since the FORTRAN preliminary report of 1954 uses the word 'array' casually in the modern one-dimensional sense as interchangeable with 'subscripted variables', the latter being the more common terminology at the time. By comparison, a virtually unknown paper by Rutishauser in 1952 describing the "for" loop did not use the word 'array' at all, only 'subscripted variables'. (Rutishauser was an accomplished mathematician and quite possibly the world's first computational scientist.) A paper by Laning and Zierler at MIT in 1954 describing a formula compiler also used only the term 'subscripted variables'.
Backus's papers also have evidence showing that FORTRAN I was clearly written specifically to take advantage of the IBM 704's machine capabilities. Not only was the IBM 704 the world's first commercially successful computer, it was also an improvement over the preceding IBM 701 in providing index registers (3 of them) and floating point instructions which were fast for its era. Backus's papers describe how providing hardware support for indexing and floating point was revolutionary, as all programs up to that time had to write in all these instructions by hand (and for many programs was pretty much all they did).
So it is clear to me now that the changeover in the implied dimensionality of the word 'array' must be related to how the array developed as a data structure abstracting away indexing operations. By the time IAL (pre-ALGOL) came on the scene in 1958, the idea of indexable homogeneous containers was already well established. But I still haven't found any strong smoking gun evidence introducing the one-dimensional sense of the word. I suspect further digging into the description of the IBM 704 may be necessary. The 704 was not the first to provide index registers, but it may have been the first to call them as such. (The Manchester Mark I computer of 1948 appears to be the first computer with an index register, but it was called a B line. The [patent](https://www.google.com/patents/US3012724) claiming to cover index registers uses the term "control instruction" - no arrays mentioned - but it very cutely describes numbers as residing in known locations or "addresses" in quotes.)
Priority is tricky to nail down e.g. the EDSAC was operational a year before the Mark 1 (which actually was not operational until 1949). Because of the "B" Williams Tube which held the two index registers of the Mark 1, many other manufacturers -- e.g. Burroughs -- later called their index registers "B registers". (Also, I think the Univac I and II were successful commercially, and earlier than the 704.)
I started programming in earnest and for money in 1961 mostly in machine code on a variety of machines, and I'm pretty sure that the notions of single dimensional sequential locations of memory as structures (most often called arrays, but sometimes called lists), predated the idea of index registers. This is because I distinctly remember being first taught how to move through arrays sequentially just by direct address modification -- this was called indexing in the Air Force -- and later being introduced to the index register or registers -- depending on the machine.
My strong guess (from above) is that the array as sequential memory accessed by address modification of some kind was a primary idea in the very first useful computers in the late 40s, and that this idea pre-dated index registers.
It would be well worth your time to look at the instruction set of EDSAC, and try writing some programs.
Yes, there are at least two separate stages of historical development here. The first is when people realized it was useful to repeat the same operation on different data in memory and viewed the collection of data as a variable in its own right. The earliest term I can find for this concept is "subscripted variable" (many examples prior to 1954, e.g. Rutishauser, 1952 in the "for" loop paper; Laning and Zierler, 1954)
but the idea appears to go all the way back to Burk, Goldstine and Von Neumann in 1946. Quoting p. 9, paras. 3.3-4:
"In transferring information from the arithmetic organ back into the memory there are two types we must distinguish: Transfers of numbers as such and transfers of numbers which are parts of orders. The first case is quite obvious and needs no further explication. The second case is more subtle and serves to illustrate the generality and simplicity of the system. Consider, by way of illustration, the problem of interpolation in the system. Let us suppose that we have formulated the necessary instructions for performing an interpolation of order n in a sequence of data. The exact location in the memory of the (n + 1) quantities that bracket the desired functional value is, of course, a function of the argument. This argument probably is found as the result of a computation in the machine. We thus need an order which can substitute a number into a given order-in the case of interpolation the location of the argument or the group of arguments that is nearest in our table to the desired value. By means of such an order the results of a computation can be introduced into the instructions governing that or a different computation. This makes it possible for a sequence of instructions to be used with different sets of numbers located in different parts of the memory.
"To summarize, transfers into the memory will be of two sorts:
"Total substitutions, whereby the quantity previously stored is cleared out and replaced by a new number. Partial substitutions in which that part of an order containing a _memory location-number_-we assume the various positions in the memory are enumerated serially by memory location-numbers-is replaced by a new _memory location-number_.
"3.4. It is clear that one must be able to get numbers from any part of the memory at any time. The treatment in the case of orders can, however, b more methodical since one can at least partially arrange the control instructions in a linear sequence. Consequently the control will be so constructed that it will normally proceed from place n in the memory to place (n + 1) for its next instruction."
The language is of course archaic, but the idea described clearly is that of indexing in 3.3 and arrays in 3.4. They use the word "sequence" but arguably this usage is in its ordinary mathematical sense.
The written historical evidence, at least, would confirm your strong guess that the idea of arrays itself is older than index registers.
There's a missing etymological link though: when did a sequence of data stored consecutively in memory become associated with the word "array"? Still, the earliest written reference I can find for this second stage of historical development is the 1954 preliminary report on FORTRAN.
Maybe the word "array" is somehow derived from the advent of RAM, which even in its earliest form in Williams tubes had memory locations arranged physically in two dimensions. So right from the start we have two dimensions physically, but only one dimension logically, since the earliest computer instructions only dealt with (one-dimensional) offsets, if at all. Furthermore, popular science accounts of magnetic core memory describe them in terms of arrays. To give one example,
the June 1955 issue of the Scientific American (no 192, pp 92–100) writes about "magnetic core arrays".
I missed that the quote was Backus's, it makes a lot more sense now.
The quality of your photos is excellent and I doubt a scanner would make them much better. I wish there was a transcript though.
As a start, here is the first letter transcribed:
to John Backus
International Business Machines Corporation
5600 Cattle Road
SAN JOSE CA 95193
U.S.A
Monday 29th of May, 1978
“To interrupt one’s own researches in order to
follow those of another is a scientific pleasure
which most experts delegate to their assistants.”
(Eric Temple Bell in
“Development of Mathematics”)
Dear John,
I do not claim to be more noble or sensible than “most
experts”; perhaps it is only becauseI have only one
assistant to whom I can delegate no more than one
man can do. But when you open your letter with:
“I am quite sure that you have never read
any paper I’ve sent you before”
it is my pleasure to inform you that - although
“quite sure” - you were very wrong. I very well
remember that you mailed me a sizeable paper
on reduction languages to which I owe my introduction
to that subject. I didn’t only read it, parts of it
were even studied. I also remember that it left me
with mixed feelings.
I can very well understand your excitement, although,
for lack of experience, I still cannot share it. I am
far from delighted with the state of the art of programming
today, and anyone suggesting an alternative has in
in [sic] principle my sympathy - until, of course, he loses it again,
like Terry Winograd did when he suggested natural language
programming - “natural descriptions”! - as an alternative-.
In the long run I have more faith in any rephrasing of the
programming task that makes it more amendable to mathematical
treatment. But you must have lots of patience, for the
run will be very long. It isn’t just mental inertia - that
problem can be solved generally by education a new
generation of youngsters and waiting until the old ones
have died - . It is the development of a new set of
techniques needed to achieve a comparable degree of
proficiency.
Could you get me a copy of G.A. Mago’s (not
yet published) “A network of microprocessors to execute
reduction languages”? That might whet my appetite!
From various angles I have looked into such networks
and I am not entirely pleased with what I have
seen. Firstly I suspect our techniques for proving the
correctness of such designs: each next proof seems to
differ so much from all the previous ones. I suspect
I discovered that all I could design were special
purpose networks, which, of course, made me suspect
the programming language in the Von Neumann style
which, already before you have chosen your problem,
seems to have set you on the wrong track.
Semantically speaking the semicolon is, of course,
only a way of expressing functional composition: it
imposes the same order that can also be expressed
with brackets - innermost brackets first -. In combination
with the distribution you can generate many innermost
bracket pairs, thus expressing very honestly that it is
really only a partial order that matters. I like that,
it is honest.
When you write “one can transform programs [....]
by the use of laws [...] which are _part of the program-
ming language_” etc. I am somewhat hesitant. I am
not convinced (yet?) that the traditional separation in
fixed program, variable data and assertions is a
mistake. The first and the last - program and assertions -
are somewhat merged in LUCID, in which correctness
proofs are carried out in (nearly) the same formalism
as the program is expressed in. On the one hand that
presents a tempting unification, on the other hand I
thought that mathematicians speerated carefully and
for good reasons the language they talk about and
the metalanguage in which they do so. To put it in
another way: given a functional program, I feel only
confident if I can do enough other significant things
to it besides easy[striked out three times] carrying it out. And those I don’t
see yet. The almost total absence of redundancy
is another aspect of the same worry. In the case
of a traditional program we know how to make it
redundant: by inserting assertions, combination
of text and assertions makes it into a logically
tightly knit whole, that gives me confidence. How
do we this with functional programs? By supplying
two of them and an argument that demonstrates
their equivalence?
What about the following example? (My notation
because I lack the fluency in yours.)
(1) Consider the function f defined by:
f(0) = 0, f(1) = 1, f(2n) = f(n), f(2n+1) = f(n) + f(n+1)
(2) Consider the following program (“peven” = “positive and even”,
so for “podd”)
{N>=0} n, a, b := N, 1, 0;
_do_ peven(n) -> a, n := a + b, n/2
[] podd(n) -> b, n := b + a, (n-1)/2
_od_ {b=f(N)}
Here (1) gives a recursive definition of f, (2) gives
a repetitive program. Both definitions can be translated
in a straightforward manner into functional programs.
What would be involved in the demonstration of their
equivalence?
The above seems to me little example of the appropriate
degree of sophistication to try your hands on. (I am
not going to try it myself, as I fear that my past
experience would misguide me: I am afraid that I
wouldn’t rise above the level of translating (2)’s
traditional correctness proof - with which I am very
familiar - in an unfamiliar notation. Good luck!)
With my greetings and warmest regards,
yours ever
_Edsger_
P.S. Please note my new postal code: 5671 AL
(capitals obligatory) _in front of_ the village name
EWD.
Maybe someone with OCR software at hand can give the typed ones a try?
As an addition to your transcription, here is the last letter transcribed:
to John Backus
91 Saint Germain Avenue
SAN FRANCISCO, California 94114
U.S.A.
12th July 1979
Dear John,
My schedule is very tight, and I am afraid that I
must disappoint John Williams; Monday 20 or Tuesday 21
August seem the only two slots that are really available.
I shall arrive in the USA on Sunday 29th July, and
had already committed myself for the week of Monday
30 July to Burroughs in Mission Viejo. My wife and
the two youngest children will join me for the first
three weeks of the trip, and the week of 6th August
was planned as "the real holiday." Since then Bill
McKeenan has twisted my arm: the programming
course to be given that week was so heavily over-
booked that he asked me to give a second course,
in parallel to David's. Under those circumstances
I would like to avoid further commitments in
the week starting on Monday 13: it is their last
week in the USA, and I have then been working
almost all the time!
Immediately after WG2.3 I shall go to Austin,
Texas (Burroughs and University), and then home!
If I have a completely free choice, I think I
would prefer Monday 20 slightly over Tuesday 21.
New title and abstract:
Title: Termination Detection for Diffusing Computations
Abstract: The activity of a finite computation may
propagate over a network of machines, when machines
may delegate (repeatedly) subtasks to their
neighbours; when such a computation is fired
from a single machine, we call it "a diffusing
computation." We present a signalling scheme
--to be superimposed on the diffusing computation--
enabling the machine that fired it to detect its
termination. Our signalling scheme is perfectly
general in the sense that, independently of the
topology of the network, it is applicable to
any diffusing computation. (End of abstract.)
Please give my regards to John Williams and tell
him how sorry I am that I shall miss him. I
am not familiar with distance and transport facilities
between Santa Cruz and San Jose. If it is better
when I come to San Jose the day before that
is fine with me; may I leave the logistics to
you? (Wife and children leave on Friday 17th of August.)
With my greetings and best wishes,
yours ever
_Edsger_
I'm genuinely surprised that one would say that Julia is "slowing down in development". Perhaps it's because less press is being generated about Julia? Or that the commit rate has gone down slightly, now that the easier issues have been picked off and the remaining work will take longer for the next round of incremental developments? I'm not sure what the OP meant, but from the inside, we are busier than ever.
- Both Julia Computing and the Julia Lab have grown sizably over the past two years. The Lab now houses ten full-time researchers (up from four last year), with five new students coming online over the summer and fall. We also maintain more active research collaborations with more research groups at MIT and off-campus.
- Julia is a grateful recipient of 12 Google Summer of Code slots this year, compared to 8 for 2015's Julia Summer of Code program (sponsored by the Moore Foundation) and 4 for GSoC 2014.
- JuliaCon grew from 72 attendees in 2014 to 225 in 2015 and we are on track to meet or exceed last year's ticket sales for 2016.
- New packages continue to be registered on the central METADATA repository at roughly the same rate since June 2014. http://pkg.julialang.org/pulse.html
By some measures we are still a relatively small project, but I don't see any serious evidence for the imminent heat death of the Julia universe.