Having client state just be a replica of server state solves so many problems I don't understand why the concept never caught on. Pouchdb/couchdb are still the only ones doing it afaik.
Instead we have a bajillion layers of CRUD all in slightly different protocols just to do the same read or write to the database.
When your data is public and immutable, this approach is very pleasant. The client becomes just another caching layer and worst case it's presenting a historical version of the truth. You can even extend this across tabs with things like local storage.
This breaks down quickly once you have data that could become private or mutate rather than append.
Couchbase (with its sync gateway) uses "channels" to sync data. It even lets you change the channel of a doc, and to the sync client it shows as if the doc was deleted (if the user is not part of the new channel)
Yeah, couchdb more or less requires you to replicate data for individual users if you need complex permissions and want the user to access the couchdb directly.
Permissions in general need to be handled by custom reconciliation functions (dropping unauthorized changes) or some kind of nanny system that can react to changes.
For example, imagine blog posts as documents, and a list of comments inside that document. Instead of the user adding/changing the comment list, the user would add a record to a comment request list, and either the reconciliation process or a nanny service checks the requests and updates the comment list.
The much simpler solution of course is to not let the users have any write access to the couchdb and just use a REST API. But then you loose much of the benefits of couchdb...
I think the Firebase Realtime Database and Firestore have a good model for offline and being able to have private and mutable data. It does get complex but the Firebase SDKs do the heavy lifting for you here.
Can you share more about the private and mutable data functionality for Firebase? I used them a few years back, and never really understood how to do private data without building my own ACL inside Firebase.
Thanks for the response. I suppose my biggest question is not how I can store private data (I can just make it inaccessible via the right rules). But, it seems like I am then layering my own ACL system onto those rules. And, I never got a sense there was an easy way to write a test that simulated my rules against my data and made sure I was not accidentally creating a leaky rule.
In so many ways it is SO much easier to use Firebase because all the pieces are right there as compared to a DB + Server + Front End + Tooling. But, I still always worried that I would somehow leave a gaping hole in my data and not know about it.
And, I was never really sure how I can easily do joins across data without writing my own bespoke metalanguage inside Firebase. A link posted today on HN talked about XML does turn out to be good for nested data (hence the reason it is used for UIs), and it feels like Firebase being more or less JSON loses in this respect.
That's just my experiences, and I say that loving Firebase.
Those two examples made me think: Firebase removes a lot of complexity for me, but it forces me to write my own layer of complex access and DB logic which I never felt fully qualified to do, and as such, just went back to using databases with an ORM and a backend server.
In reality data can't become private again, after being available. You may try to contact all users to delete their copy, but they may not respect that.
That's in theory. In actual reality, if you're making an app that has private data and is not crawled by bots, most of the time users don't save everything that they see.
I think the parent means clients (user agents), not actual human users. As soon as any state that needs to be hidden is exposed to clients, that's a security breach regardless of whether any human eyeballs have seen it.
This is true, too, but I meant users, too. It's all too common for users to screenshot things, etc. I see it a lot of the time on Twitter for example. You can delete tweets, but oftentimes it's pointless. But I also regularly see my gf taking photos with her phone of various apps she uses on her notebook, just to have the info around on her phone. I suspect it's pretty common, because it's much more low-tech then saving pages or API scraping.
In short, server data is more normalized than the data client needs. As you get closer to view layer, your data gets denormalized further and further. Client-server interaction sits somewhere in the middle to both minimize the bytes-over-wire as well as the round-trips to the backend to get up-to-date.
Take a look at GraphQL, its central promise is to let client choose what's the optimal data it needs (that often denormalized through nested GraphQL queries), and send it in one batch.
It is not to say there shouldn't be a simple replica. It is just if we want it to be a simple replica, we should have a server-side mirrored some-what-denormalized representation rather than just the raw server-data models.
Data security is a huge issue, Facebook.com has very specific whitelisted access patterns encoded as CRUD endpoints. 1) Users want to make sure their data is used in non-creepy or non-stalky ways; 2) Facebook's business needs to control the access point so they can serve you adds or otherwise monetize. So the API exposes only limited access patterns, the API TOS disallows caching, and they go to great lengths to prevent scrapers.
If immutable fact/datom streams with idealized cache infrastructure becomes a thing (and architecturally i hope it does) it's going to need DRM to be accepted by both users and businesses.
I'm assuming that the client state would be a lazy representation of the back end state, that only pulls data as needed. The result being that local and server state must both be treated only as asynchronously accessible.
There is a limit on what data can be replicated with the client.
For a simple chat-app you can replicate all messages of a user. But you would never replicate the whole state of wikipedia to make it searchable.
Wrong. This is popular anti-Meteor FUD spread by people who don't know how to use its features properly or have the engineering/computer science background to design a system to be able to manage computational complexity or scalability.
In 2015, my business implemented a Meteor-based real-time vehicle tracking app utilising Blaze, Iron Router, DDP, Pub/Sub
Our Meteor app runs 24hrs/day and handles hundreds of drivers tracking in every few seconds whilst publishing real-time updates and reports to many connected clients. Yes, this means Pub/Sub and DDP.
This is easily being handled by a single Node.js process on a commodity Linux server consuming a fraction of a single core’s available CPU power during peak periods, using only several hundred megabytes of RAM.
How was this achieved?
We chose to use Meteor with MySQL instead of MongoDB. When using the Meteor MySQL package, reactivity is triggered by the MySQL binary log instead of the MongoDB oplog. The MySQL package provides finer-grained control over reactivity by allowing you to provide your own custom trigger functions.
Accordingly, we put a lot of thought into our MySQL schema design and coded our custom trigger functions to be selective as possible to prevent SQL queries from being needlessly executed and wasting CPU, IO and network bandwidth by publishing redundant updates to the client.
In terms of scalability in general, are we limited to a single Node.js process? Absolutely not - we use Nginx to terminate the connection from the client and spread the load across multiple Node.js processes. Similarly, MySQL master-slave replication allows us to spread the load across a cluster of servers.
For those using MongoDB, a Meteor package named RedisOplog provides improved scalability with the assistance of Redis's pub/sub functionality.
That's exactly the model that Apollo Client library uses (GraphQL-based data store for react), and teams I spoke that tried it are quite enthusiastic for this reason.
Same way you scale replication for any server, by sharding and only replicating the shards you care about.
The "shard" could just be that users own feed in this case. Then you get offline for free where user adds a tweet and it appears immediately, replicating back to server when he goes back online. The server replica side will need to be a lot more complicated to deal with broadcasting but I don't see why it won't work.
I attempted something similar on a current project, the problem is with inital data loading. If you are hitting the URL/Page for the first time you are waiting minutes or more for non trivial data sets.
Why not just load what is needed and hydrate the data over time? What about datasets where you need pagination/ordering etc. And the only way to guarantee order is to pull the whole set?
In twitter-like applications, the default ordering is usually just ORDER BY timestamp DESC. You could rely on this default ordering to load the first few dozen items on first visit, and load the remainder asynchronously. Sort of like automatic infinite scrolling.
Of course, users with limited RAM and metered connections won't like that. Which is another reason why it didn't catch on.
I've heard somewhere that Twitter maintains a copy of all tweets in a user's timeline for that user.
If I had to make a Twitter clone with CouchDB, I would probably have one timeline document per user, and maybe one per day to limit the syncing bandwidth.
> Now that I think about it, languages that aren't subject-verb-object probably find OOP insane.
I'm not sure if that's true in this generality. After all, "agent.doSomeThingWith(object)" is not a declarative sentence, it's a command. So you'd have to look for languages where imperatives don't have subject-verb-object (as an acceptable) order.
I'm sure there exist some, but at least I would expect them to start with the agent you are commanding.
The agent still needs to hear the complete command to carry it out, so there's no strong pressure to prefer a particular word order.
In German, the difference between the imperative "Machen Sie das!" [Do that, you (polite form of address).] and the declarative "Sie machen das." [You (polite form of address) are doing that.] is that the imperative does not put the agent first.
Interesting point about German, though du/Sie isn't the kind of "agent" I was thinking of. I meant something where you name the agent by name, like "Frau Huber, machen Sie das!". And yes, I know you can put the name at the end as well.
The du/Sie in such sentences is something that's specific to German compared to the other languages I know, which don't need (or even allow) the pronoun: "do this", "faites ça", "gjør det" are all complete sentences. In fact even in German it's specific to Sie, for while you can say "mach Du das", "mach das" is fine as well. (If you speak a different dialect from mine, you might insist on "mache" instead of "mach", but around here that ship has sailed.)
Interesting how German doesn't allow to drop the addressing pronoun due to how the sentence structure works, but Scandinavian readily allows even the number of addressees to be derived from the context. I guess I prefer English over German as the language for programming because it shoehorns a lot of structure on a later, more boiled-down incarnation of Germanic, and notably (therefore?) ships many short and flashy words, such as "if", "on", "not", "do".
Possibly dumb question: How do you ensure there's no data leakage when benchmarking transfer learning techniques? Is that even a problem anymore when the whole point is to learn "common sense" knowledge?
For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web, might contain much of the same information as the benchmark datasets, which I presume is also scraped from the web.
Hi, one of the paper authors here. Indeed this is a good question. A couple of comments:
- Common Crawl overall is a sparse web dump, it is unlikely that the month we used includes any of the data that are in any of the test sets.
- In order for the data to be useful to our model, it would have to be in the correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the label in a format our model could extract meaning from. We introduced this preprocessing format so I don't believe this would ever happen.
- Further, most of these datasets live in .zip files. The Common Crawl dump doesn't unzip zip files.
- C4 is so large that our model sees each example (corresponding to a block of text from a website) roughly once ever over the entire course of training. Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps.
However note that the dataset used to train GPT-2 is about 20x smaller than C4. I'm not 100% sure how many times the training set was repeated over the course of training for GPT-2, but it was likely many times. I stand by my statement (that memorization is unlikely with SGD and no repetition of training data) but I would be happy to be proven otherwise.
I think that this is a good question that I would also like to know the answer to. Additionally, are there other benchmarks or tests where this issue (possibly) presents itself?
> If your app is architected to have any rendering operation take long enough that it's perceptible to the user, STOP DOING THAT THING DURING RENDERING.
As is tradition, frontend developers are still slowly reinventing the techniques from other fields.
The neat thing about UI is that your end user is a human and a human can't visually process too many items at once on a screen. This means you should never really need to consider more than O(thousands) elements to render either. That should always fit under 16ms.
This has been known to game developers since forever. Don't spend resources on things that won't be displayed. If you are doing too much work that means you aren't culling invisible objects or approximating things at a lower level of detail.
In practice for frontend developers, this just means using stuff like react-virtualized so that even if you have a list of a bajillion objects, only the stuff in the viewport is rendered.
I don't think there's any slow reinvention going on in the way you're describing -- the adoption of ideas from (for example) game development has been a total non-secret since the dawn of the React age. There's definitely slow reimplementation going on though, and unfortunately it's having to be done at the scripting level, rather than the engine level.
Yea you're right, I guess there is a distinction between reinventing and reimplementing. The authors even mentioned that concurrent mode is inspired by "double buffering" from graphics.
I am just trying to support the grandparent comment that the current optimization is at the wrong level of abstraction. React is at the level of a "scene graph" (dom). If it's slow, the first technique to reach for should be stuff like "occlusion culling".
That's largely analogous to what people attempt to do with virtualisation, but the ability of things to have unpredictable resizing and reflow behaviours throws a few spanners into the works. My understanding that (at least until recently, things might be different now) is that most game engine optimisations depend on being able to make a lot of assumptions that usually hold true, or being able to calculate most of your scene graph optimisations during a build step.
It's easy to gain physical intuition because you can often explain one physical phenomenon in terms of another physical phenomenon that you have much more real life experience with.
But with mathematics, "intuitive" analogies are all in terms of other mathematical objects! You can't build intuition if you don't even know what they trying to abstract over.
In that regards, The Princeton Companion to Mathematics is fantastic because it maps out how the different fields of mathematics are interrelated.
Throwing promises to emulate algebraic effects seems really hacky.
I wonder if React would've needed to go down this path if they were built by a different company.
For example Google/Apple/Microsoft all have huge influence over browser implementations (chrome/safari/edge respectively). They might've been able to push a ECMAScript proposal for proper language support instead.
It's a very clever hack regardless but it only seems necessary because Facebook doesn't have control over the whole stack.
> push a ECMAScript proposal for proper language support instead
I don't think this is the way it does (or should) work, though.
No company should be able to just come up with essentially an application layer pattern and immediately push that pattern down the stack, just because it's their idea and they think it's good.
A pattern needs to get a lot of practical usage, be iterated on a lot, and have strong consensus behind it before it's pushed down the stack to become a more fundamental building block in ECMAScript / browser APIs.
It's a form of make it work -> make it good -> make it fast. The process should take years. It's what happened with jQuery which eventually inspired the native DOM selector APIs, and I think the same is happening with React.
I honestly wouldn't be surprised at all to see JSX standardised as part of ECMAscript within the next 5 years. The rest of the stuff they're doing with the fibre architecture etc may follow too, eventually, if it turns out to be a really good idea after years of people using it in practice and agreeing it really ought to be part of the stack.
I agree. JavaScript obviously has problems but it’s also the most used development platform of all time.
Part of that is because the runtime API really doesn’t change much. And for many years the only changes allowed were very tightly scoped: a single new function call, that takes two strings. Stuff like that. The attitude was “what is the smallest thing we can do to make it technically possible for to get the job done” and leave developer ergonomics to the web frameworks.
It forces people to build real platform layers on top of the browser. That causes fragmentation but it is also a hot crucible for competition. It has led to arguably the most competitive developer tool landscape of all time.
It’s easy to say “browsers should just standardize on X” and be right.
However it’s exactly the very slow acceptance of standards that had led to such a strong ecosystem of ideas. So in the aggregate, as you indicate, it’s not good to land on standards too fast.
Members of the React team are active in TC39. There are some features of React that have been replaced by language features (like spreading props into another component).
I was really hoping this would be about "Philosophy of Science"[0]
For example in "The Problem of Induction" you can see a bajillion sunrises but can't infer that there will be another. Likewise you can conduct a bajillion controlled experiments and can't really tell if the result will generalize. Worse yet is that the more conditions you control, the less likely it's to generalize. Infamously, clinical trials wanted to control for the variations caused by periods so they only tested on men and it turns out some drugs affect women differently! In general, if you control for everything, you can't say anything about the real world outside of those precise lab controlled conditions.
> For example in "The Problem of Induction" you can see a bajillion sunrises but can't infer that there will be another.
Which is why scientists, like everyone else, uses statistical reasoning.
> In general, if you control for everything, you can't say anything about the real world outside of those precise lab controlled conditions.
This is one of those statements which is only true if you ignore all the times it's false. It's like Zeno's Paradox: If this really troubles you, it's a sign your model's incomplete, not that there's something irreconcilably wrong with the universe. Some of what you're getting at can be fixed with better experimental design, but the rest can only be fixed by, as I said, statistical reasoning, which is universally used in practice.
It's possible to be somewhat uncertain. Some people don't get that, or refuse to acknowledge it so they can sow FUD.
This is a classic philosopher vs. pragmatic discussion.
Zeno: "How can movement exist? A flying arrow, at any point in its trajectory, is static, but at the next moment, it-"
Pragmatic: "Hey, I shot ten arrows. They took 4 seconds average to reach the target at 200 m, so they move at 50 m/s, therefore movement does exist. Can we eat something? I'm hungry."
Instead we have a bajillion layers of CRUD all in slightly different protocols just to do the same read or write to the database.