RX – a new random-access JSON alternative

(github.com)

120 points | by creationix 14 hours ago

22 comments

NoSalt 2 minutes ago
Why do we need an "alternative" when JSON, itself, is so fantastic?
btown 10 hours ago
This is really interesting. At first glance, I was tempted to say "why not just use sqlite with JSON fields as the transfer format?" But everything about that would be heavier-weight in every possible way - and if I'm reading things right, this handles nested data that might itself be massive. This is really elegant.
My one eyebrow raise is - is there no binary format specification? https://github.com/creationix/rx/blob/main/rx.ts#L1109 is pretty well commented, but you can't call it a JSON alternative without having some kind of equivalent to https://www.json.org/ in all its flowchart glory!
[-]
- creationix 6 minutes ago
  Thanks. I had this for older versions, but forgot to write it up again for the latest version.
  One old version that is meant to be more human readable/writable is jsonito
  https://github.com/creationix/jsonito
  I'll add similar diagrams and docs for the format itself here.
Levitating 11 hours ago
JSON is human-readable, why even compare it with this. Is any serialization format now just a "JSON alternative"?
[-]
- jy14898 4 hours ago
  Came to the same conclusion the moment I had to hunt to see the outputs https://github.com/creationix/rx/tree/main/samples
- creationix 10 hours ago
  - this encodes to ASCII text (unless your strings contain unicode themselves) - that means you can copy-paste it (good luck doing that with compressed JSON or CBOR or SQLite - there is a scale where JSON isn't human readable anymore. I've seen files that are 100+MB of minified JSON all on a single very long line. No human is reading that without using some tooling.
  [-]
  - bawolff 9 hours ago
    That kind of feels a bit worst of both worlds. None of the space savings/efficiency of binary but also no human readability.
    Being able to copy/paste a serialization format is not really a feature i think i would care about.
    [-]
    - creationix 12 minutes ago
      It's a gradient. I did design several binary formats first, but for my use cases, this is actually better. There is nuance to various use cases.
      > None of the space savings/efficiency of binary
      For string heavy datasets, it's nearly the same encoding size as binary. I get 18x smaller sizes compared to JSON for my production datasets. This was originally designed as a binary format years ago (https://github.com/creationix/nibs) and then later after several iterations, converted to text.
      > Being able to copy/paste a serialization format is not really a feature i think i would care about
      Imagine being paged at 3am because some cache in some remote server got poisoned with a bad value (unrelated to the format itself). You load the value in dashboard, but it's encoded as CBOR or some binary format and so you have to download it in a binary safe way, upload that binary file to some tooling or install a cbor reader to your CLI. But then you realize that you don't have exec access to the k8s pods for security reasons, but do have access to a web-based terminal. Again, to extract a binary value you would need to create a shell, hexdump the file and somehow copy-paste that huge hexdump from the web-based terminal to your local machine, un-hex dump it, and finally load it into some CBOR reader.
      A text format, however is as simple as copy-paste the value from the dashboard and paste into some online tool like https://rx.run/ to view the contents.
  - mpeg 2 hours ago
    if one of the advantages is making it copy-pastable then I would suggest the REXC viewer should give you the option to copy the REXC output, currently I have no way of knowing this by looking at your github or demo viewer
    another thing, I put in a 400KB json and the REXC is 250KB, cool, but ideally the viewer should also tell me the compressed sizes, because that same json is 65kb after zstd, no idea how well your REXC will compress
    edit: I think I figured out you can right click "copy as REXC" on the top object in the viewer to get an output, and compressed it, same document as my json compressed to 110kb, so this is not great... 2x the size of json after compression.
    [-]
    - creationix 10 minutes ago
      Thanks for testing it out! Yes, the website could use some love to make everything more discoverable.
      The primary use case is not compression, it's just a nice side effect of the deduplication. This will never beat something like zstd, brotli, or even gzip.
      My production use cases are unique in that I can't afford the CPU to decompress to JSON and then parse to native objects. But with this format, I can use the text as-is with zero preprocessing and as a bonus my datasets are 18x smaller.
  - rendaw 5 hours ago
    Are there any examples? If it's ASCII I'd expect to see some of the actual data in the readme, not just API.
    Unless, to read that correctly, it only has a text encoding as long as you can guarantee you don't have any unicode?
    [-]
    - creationix 8 minutes ago
      oh, sorry about that. I forgot to include the description of the format with examples.
      I did add some small examples to the repo.
      https://github.com/creationix/rx/blob/main/samples/quest-log...
      The older, slightly outdated, design spec is in the older rex repo (this format was spun out of the rex project when I realized it's actually a good standalone format)
      https://github.com/creationix/rex/blob/main/rexc-bytecode.md
  - kukkamario 3 hours ago
    You don't want to copy-paste anything like that as text anyway. Just copy and paste files.
    No human is reading much data regardless of the format.
    What is the benefit over using for example BSON?
  - soco 2 hours ago
    I have an idea, why don't we all go back using XML at this point, as any initial selling point / differentiator has been slowly eroded away?
- Gormo 1 hour ago
  It's also quite odd to create a serialization format optimized for random access.
  [-]
  - creationix 4 minutes ago
    Serialized just means encoded as a stream of bytes so that it can be transferred between systems. There are absolutely cases where you want to be able to query a value directly like a database instead of parsing the entire thing to memory before you can read it. Think of this as no-sql sqlite.
  - j16sdiz 32 minutes ago
    many serialization format are just a memory structure dump.
  - IshKebab 48 minutes ago
    Not at all. What makes you say that?
- dietr1ch 11 hours ago
  cat file.whatever | whatever2json | jq ?
  (Or to avoid using cat to read, whatever2json file.whatever | jq)
  [-]
  - Gormo 1 hour ago
    That's not really random access, though. You're effectively just searching through the entire dataset for every targeted read you're after.
    What might be interesting is to have a tool that processes full JSON data and creates a b-tree index on specified keys. Then you could run searches against the index that return byte offsets you can use for actual random access on the original JSON.
    OTOH, this is basically just recreating a database, just using raw JSON as its storage format.
  - creationix 10 hours ago
    Or in this case, just do `rx file.rx` It has jq like queries built in and supports inputs with either rx or json. Also if you prefer jq, you can do `rx file.rx | jq`
dtech 8 hours ago
It's not quite clear to me why you'd use this over something more established such as protobuf, thrift, flatbuffers, cap n proto etc.
[-]
- maxmcd 7 hours ago
  Those care about quickly sending compact messages over the network, but most of them do not create a sparse in-memory representation that you can read on the fly. Especially in javascript.
  This lib keeps the compact representation at runtime and lets you read it without putting all the entities on the heap.
  Cool!
  [-]
  - creationix 3 minutes ago
    Exactly. Low heap allocations when reading values is one of the main driving factors in this design!
  - IshKebab 43 minutes ago
    Amazon Ion has some support for this - items are length-prefixed so you can skip over them easily.
    It falls down if you have e.g. an array of 1 million small items, because you still need to skip over 999999 items to get to the last one. It looks like RX adds some support for indexes to improve that.
    I was in this situation where we needed to sparsely read huge JSON files. In the end we just switched to SQLite which handles all that perfectly. I'd probably still use it over RX, even though there's a somewhat awkward impedance mismatch between SQL and structs.
- konart 3 hours ago
  What if you are reading from a service which already have an established API?
  It's not like you can just tell them to move to protobuf.
garrettjoecox 11 hours ago
Very cool stuff!
This did catch my eye, however: https://github.com/creationix/rx?tab=readme-ov-file#proxy-be...
While this is a neat feature, this means it is not in fact a drop in replacement for JSON.parse, as you will be breaking any code that relies on the that result being a mutable object.
[-]
- creationix 10 hours ago
  True, the particular use case where this really shines is large datasets where typical usage is to read a tiny part of it. Also there is no reason you couldn't write an rx parser that creates normal mutable objects. It could even be a hybrid one that is lazy parsed till you want to turn it mutable and then does a normal parse to normal objects after that point.
barishnamazov 11 hours ago
You shouldn't be using JSON for things that'd have performance implications.
[-]
- creationix 10 hours ago
  As with most things in engineering, it depends. There are real logistical costs to using binary formats. This format is almost compact as a binary format while still retaining all the nice qualities of being an ASCII friendly encoding (you can embed it anywhere strings are allowed, including copy-paste workflows)
  Think of it as a hybrid between JSON, SQLite, and generic compression. This format really excels for use cases where large read-only build artifacts are queried by worker nodes like an embedded database.
  [-]
  - Asmod4n 5 hours ago
    The cost of using a textual format is that floats become so slow to parse, that it’s a factor of over 14 times slower than parsing a normal integer. Even with the fastest simd algos we have right now.
    [-]
    - creationix 25 minutes ago
      if you data is lots and lots of arrays of floats, this is likely not the format for you. Use float arrays.
      Also note it stores decimal in a very compact encoding (two varints for base and power of 10)
      That said, while this is a text format, it is also technically binary safe and could be extended with a new type tag to contain binary data if desired.
    - HelloNurse 3 hours ago
      So it depends. Float parsing performance is only a problem if you parse many floats, and lazy access might reduce work significantly (or add overhead: it depends).
      [-]
      - creationix 22 minutes ago
        Exactly. My for use cases, this format is amazing. I have very few floats, but lots and lots of objects, arrays and strings with moderate levels of duplication and substring duplication. My data is produced in a build and then read in thousands or millions of tiny queries that lookup up a single value deep inside the structure.
        rx works very well as a kind of embedded database like sqlite, but completely unstructured like JSON.
        Also I'm working on an extension that makes it mutable using append-only persistent data structures with a fixed-block caching level that is actually a pretty good database.
    - meehai 5 hours ago
      and with little data (i.e. <10Mb), this matters much less than accessibility and easy understanding of the data using a simple text editor or jq in the terminal + some filters.
      [-]
      - xxs 4 hours ago
        what do you mean by little data, most communication protocols are not one off
- hrmtst93837 3 hours ago
  That rule sounds clean until the DB dump, API trace, or langauge boundary lands in your lap. Binary formats are fine for tight inner loops, but once the data leaks into logs, tooling, support, or a second codebase, the bytes you saved tend to come back as time lost decoding some bespoke mess.
- squirrellous 9 hours ago
  I agree in principle. However JSON tooling has also got so good that other formats, when not optimized and held correctly, can be worse than JSON. For example IME stock protocol buffers can be worse than a well optimized JSON library (as much as it pains me to say this).
  [-]
  - tabwidth 7 hours ago
    Yeah the raw parse speed comparison is almost a red herring at this point. The real cost with JSON is when you have a 200MB manifest or build artifact and you need exactly two fields out of it. You're still loading the whole thing into memory, building the full object graph, and GC gets to clean all of it up after. That's the part where something like RX with selective access actually matters. Parse speed benchmarks don't capture that at all.
    [-]
    - magicalhippo 1 hour ago
      > The real cost with JSON is when you have a 200MB manifest or build artifact and you need exactly two fields out of it.
      There are SAX-like JSON libraries out there, and several of them work with a preallocated buffer or similar streaming interface, so you could stream the file and pick out the two fields as they come along.
      [-]
      - IshKebab 41 minutes ago
        You still have to parse half the entire file on average. Much slower than formats that support skipping to the relevant information directly.
    - xxs 4 hours ago
      as parser: keep only indexes to the original file (input), dont copy strings or parse numbers at all (unless the strings fit in the index width, e.g. 32bit)
      That would make parsing faster and there will be very little in terms on tree (json can't really contain full blow graphs) but it's rather complicated, and it will require hashing to allow navigation, though.
- Spivak 11 hours ago
  Can you imagine if a service as chatty and performance sensitive as Discord used JSON for their entire API surface?
jbverschoor 5 hours ago
So this is two things? A BSON-like encoding + something similar to implementing random access / tree walker using streaming JSON?
Docs are super unclear.
bsimpson 2 hours ago
It feels petty to show up with a naming not, but the name is unfortunately/confusingly similar to the already well-known RxJS.
Why is it called RX?
gritzko 3 hours ago
I recently created my own low-overhead binary JSON cause I did not like Mongo's BSON (too hacky, not mergeable). It took me half a day maybe, including the spec, thanks Claude. First, implemented the critical feature I actually need, then made all the other decisions in the least-surprising way.
At this point, probably, we have to think how to classify all the "JSON alternatives" cause it gets difficult to remember them all.
Is RX a subset, a superset or bijective to JSON?
https://github.com/gritzko/librdx/tree/master/json
[-]
_flux 6 hours ago
It doesn't seem the actual serialization format is specified? Other than in the code that is.
Is it versioned? Or does it need to be..
pshirshov 4 hours ago
Looks similar to https://github.com/7mind/sick
50lo 5 hours ago
The biggest challenge for formats like this is usually tooling. JSON won largely because: every language supports it, every tool understands it.
Even a technically superior format struggles without that ecosystem.
[-]
- latexr 4 hours ago
  And that in turn affects tool adoption. I have dabbled in Lua for interacting with other software such as mpv, but never got much into the weeds with it because it lacks native JSON support, and I need to interact with JSON all the time.
WatchDog 8 hours ago
Cool project.
The viewer is cool, took me a while to find the link to it though, maybe add a link in the readme next to the screenshot.
creationix 14 hours ago
A new random-access JSON alternative from the creator of nvm.sh, luvit.io, and js-git.
benatkin 10 hours ago
Interesting. I've heard about cursors in reference to a Rust library that was mentioned as being similar to protobuf and cap'n proto.
Does this duplicate the name of keys? Say if you have a thousand plain objects in an array, each with a "version" key, would the string "version" be duplicated a thousand times?
Another project a lot of people aren't aware of even though they've benefitted from it indirectly is the binary format for OpenStreetMap. It allows reading the data without loading a lot of it into memory, and is a lot faster than using sqlite would be.
Edit: the rust library I remember may have been https://rkyv.org/
Spivak 11 hours ago
I love these projects, I hope one of them someday emerges as the winner because (as it motivates all these libraries' authors) there's so much low hanging fruit and free wins changing the line format for JSON but keeping the "Good Parts" like the dead simple generic typing.
XML has EXI (Efficient XML Interchange) for precisely the reason of getting wins over the wire but keeping the nice human readable format at the ends.
[-]
- snthpy 7 hours ago
  TIL.
  EXI looks useful. Now I just wish there was a renderer in the pugjs format as I find that terse format much pure readable than verbose XML. I also find indentation based syntax easier to visually parse hierarchical structure.
transfire 6 hours ago
I am a little confused. Is this still JSON? Is it “binary“ JSON?
openclaw01 4 hours ago
[dead]
derodero24 3 hours ago
[dead]
AliEveryHour16 2 hours ago
[dead]
StephenZ15ga67 5 hours ago
[flagged]
Shahbazay0719 9 hours ago
[flagged]