NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Writing a C compiler in 500 lines of Python (2023) (vgel.me)
Buttons840 14 hours ago [-]
After many years of programming in other languages, I finally learned C, and came to realize that there aren't actually any compilers that implement all of the C spec. Even GCC and Clang have their grey areas and their bugs.

Before this, I had thought that C was a simple language. An idea propped up by articles likes this, as well as the oft touted fact that nearly every embedded system has a C compiler; no matter what you'll always have a C compiler.

This point was driven home by part of a blog post that simply states "you can't actually parse a C header"[0]. The blog makes a good supporting case for their claim. They link to a paper that says[1]:

> There exist many commercial and academic tools that can parse C.... Unfortunately, these parsers are often either designed for an older version of the language (such as C89) or plain incorrect. The C11 parsers found in popular compilers, such as GCC and Clang, are very likely correct, but their size is in the tens of thousands of lines.

And sure enough, in the OP linked blog post, they state they are only implementing a subset of the language. Of course, it still has value as a teaching tool; this is just a tangential fact about C I wanted to discuss.

[0]: https://faultlore.com/blah/c-isnt-a-language/#you-cant-actua...

[1]: https://hal.science/hal-01633123/document

flohofwoe 8 hours ago [-]
> I finally learned C, and came to realize that there aren't actually any compilers that implement all of the C spec.

I think the main reason for this is that the C spec was always just an attempt to somewhat harmonize the already existing features of different C compilers, e.g. implementations come first and then after one or two decades, the C committee tries to standardize the features that have survived. That doesn't mean that all C compiler vendors immediatedly hop on board.

But then there's of course MSVC which after a few hopeful years of modernizing their C frontend now seems to have abandondend it again (I wonder though if MSVC as a whole has been abandondend looking at the abundance of red for MSVC in the C++23 and C++26 tables: https://en.cppreference.com/w/cpp/compiler_support.html)

While looking like a weird approach at first, it has definitely worked better than the C++ committee's approach to accept 'ideas' into the standard without even a proof-of-concept implementation - e.g. the one good thing about C++ is that it acts as a filter to stupid ideas before they can make it into C ;)

pjmlp 5 hours ago [-]
I think it is a side effect of SFI (Secure Safety Initiative) at Microsoft, Azure and Windows development guidelines to use managed safe languages or Rust, leaving C and C++ for existing code bases.

Even though Microsoft employees tend to dismiss this at Reddit discussions, the lack of resources is quite visible.

flohofwoe 3 hours ago [-]
In that case they should really just deprecate MSVC and point C and C++ devs to Clang. Would make life a lot easier for library authors.
pjmlp 3 hours ago [-]
Clang is included on Visual Studio installer for ages.

It was the official answer back when they decided only to do C++ and leave C behind, until the change of heart with C11/C17 support.

Also note Apple and Google aren't any longer the nice sponsors of clang, hence it has also lost steam as other contributors aren't at the same level as they used to be.

1vuio0pswjnm7 1 hours ago [-]
"Before this, I had thought that C was a simple language."

It was a simple language. It can still be used that way

As hobbyist I write simple programs that can be compiled with -std=c89

I use these programs every day. They are faster than their equivalents in python, smaller than their equivalents in go, and require less resources or dependencies to compile than their equivalents in rust

It is easy to take something simple and make it complex

Software developers do this consistently; software/language "by committee" faciltates it

Generally developers commenting publicly do not like "simple", they prefer "easy"

C89 is still useful and there are lots of things that rely on it

dlcarrier 11 hours ago [-]
C has a lot of feature creep, and C++ is just C with extra feature creep.

The original C compiler ran on a PDP-11, which usually had just kilobytes of RAM. The syntax was written around compiling with such limited resources, hence the need for headers, primitives, semicolons, linkers, and so on.

It has changed a lot over time, but seems to be adding baggage, not removing it.

Joker_vD 8 hours ago [-]
The original C compiler had no need for headers or function prototypes/forward declarations. Of course, it also was not a single-pass compiler: it had two (and a half) passes and generated assembly that would then be assembled by a two-pass assembler.
fuzztester 6 hours ago [-]
Yes, function prototypes were introduced in the first ANSI version of C, IIRC, which came some years after the original C. The prototype feature was described in the second version of the classic K&R C book, The C Programming Language.
pjmlp 5 hours ago [-]
Function prototypes came to ANSI/ISO C via the ongoing work on ISO C++.
pjmlp 5 hours ago [-]
Only for C89 versus C++98, in current days of C2y versus C++23, they are two worlds apart.
pjmlp 13 hours ago [-]
That is a myth often spread by folks that think K&R C book is everything there is to know, never opened the ISO C draft PDF, learned the differences between POSIX and standard library, tried to make their code portable outside GCC or nowadays clang, or even checked the extensions chapter on the compiler.
anta40 13 hours ago [-]
Who told that? I still hear K&R being recommended as introduction material.

If you want to write portable/production-grade C code, well definitely need to study another references.

Arch-TK 2 hours ago [-]
There are very few resources for learning C which aren't themselves full of terrible C.

If you want a short introduction with the caveat that it only covers C89, only covers parts of it, and doesn't cover e.g. POSIX or anything outside of standard C then K&R2 + errata is fine.

If you want a long book on C which has a more modern approach then there is K. N. King's C a Modern Approach.

Jens' book is at least vouched for by https://iso-9899.info/wiki/Books . So I have to assume it's also okay.

pjmlp 8 hours ago [-]
Because most people know no better, and recommend their UNIX heros.

This is a more useful book for modern days,

https://www.manning.com/books/modern-c

flohofwoe already put it out clearly on his comment.

adamors 10 hours ago [-]
Exactly, I was looking into refreshing my C knowledge recently and K&R is still heavily recommended.
flohofwoe 9 hours ago [-]
The problem with K&R is that there never was a third edition that covered C99, which compared to C89 is almost a new language.

K&R is an interesting historical artifact about the basic design decisions of the C language and definitely recommended reading material, but you're not doing yourself a favour using it as reference or for learning the language. For that it is vastly outdated, C ist a much more enjoyable and powerful language since C99.

pjmlp 8 hours ago [-]
Go with such a book instead,

https://www.manning.com/books/modern-c

adamors 7 hours ago [-]
pjmlp 7 hours ago [-]
I guess it is also a good one, both authors are still active on WG14, if I am not mistaken.
4gotunameagain 12 hours ago [-]
The first link is full of youthful anger (and a bit of cringe) not recognising the immense technical debt C is carrying. It is a 50 year old language written on a PDP.

For example the author rages about things like integer sizes, while every single serious C programmer does not use ambiguously sized types.

Sure, C has issues. A lot. But it is the cornerstone of our technological marvel. Everything around you runs C, everything in space runs C as well.

Do we have much better options for some applications nowadays ? Of course.

Will C go away ? No.

sylware 9 hours ago [-]
More real life C compilers is always a good thing. But remember: to be able to build a linux kernel, you will need zillions of gcc extensions... (I have a sweet spot for the alignment attribute of stack variables in printk).

That said, "C" (C99 with benign bits of c11 required for modern hardware programming) is only the "less worse" compromise for a computer language: its syntax is already way too rich (I let you think about the beyond sanity computer language syntax out there, yep... even for currently hyped ones).

For instance: C should have only one loop keyword loop{}, finally a real hard compile time constant definition, no integer promotion, no implicit cast (unless void* or some generic number literals), probably explicit static/dynamic casts (but certainly not with the c++ syntax), sized types should be primitives and not the other way around (s32,u32,f64...), no typedef/typeof/generic/etc, but should include inline support for modern hardware architecture programming aka memory barriers/"atomics"/explicit memory unaligned access, etc.

The benchmark is a small team of average devs should be able to develop a real-life compiler in reasonable amount of time.

Ofc, the more the merrier.

1718627440 11 hours ago [-]
> [0]: https://faultlore.com/blah/c-isnt-a-language/#you-cant-actua...

This blog post is full of misconceptions.

It starts by asserting, that C defines an ABI and then complains that everything is so complicated, because actually C doesn't define it. C is defined in terms of behaviour of the abstract C machine. As far as C is concerned, there is no ABI. C only prescribes meaning to the language, it does not prescribe how you actually implement this; the compiler is free to do as it pleases, including choosing the ABI in some limits.

What defines the ABI is the platform consisting of the machine and the OS. The OS here includes at least the kernel (or some bare metal primitives) and the compiler. And the standard C library really IS part of the compiler. That's why GCC vs Clang or glibc vs muslc always comes with incompatibilities. Because these ARE different OSs. They can choose to do things the same, but this is because of some formal (POSIX, the platform vendor) or some informal (GCC and Clang) standards.

Yes a lot of ABIs are defined with C syntax, but this is, because C has terms for that and isn't too new. You can specify this in the language of your choice and it will describe the same ABI. Yes, int doesn't have a size independent of the platform. But if the specification wouldn't use C as a syntax, it would just write "this argument has the same size as described in the 'Appendix X Definitions' under 'Platform's default integer size'". Writing "int" is just a shorter notation for exactly this.

> You Can’t Actually Parse A C Header

I don't know why the choice to use the compiler to implement parsing a C header is framed as a bad thing. Is relying on write(2) from the kernel a bad thing instead of trying to bypass the kernel? The compiler is what defines the meaning of a header, why don't ask it about the result? If you don't feel like reimplementing the C preprocessor, you can also just parse preprocessed headers. These are self-contained, i.e. don't need knowing the include path. But of course this approach comes with the caveat that when the user updated the C compiler, your knowledge has become outdated or wrong. I don't know why it is framed as weird, that you need a C parser to parse C code. This is the definition of a C parser. You can't just write code that parses C and is somehow not a C parser.

> 176 triples. I was originally going to include them all for the visual gag/impact but it’s literally too many for even that.

No, they are ONLY 176 target triples (this is the LLVM term, other terms are "gnu tuple" or "gnu type") that your tool supports. There is also not the definite list, it's a syntax to describe the major components of a platform. There are decades of vendors improving their platform in incompatible ways, of course the description of this is messy.

See for example: https://news.ycombinator.com/item?id=43698363

And this is the test data for the source of GNU types: https://cgit.git.savannah.gnu.org/cgit/config.git/tree/tests... See that this contains 1180 types, but of course that's also not definite.

> pub type intmax_t = i64;

> A lot of code has completely given up on keeping C in the loop and has started hardcoding the definitions of core types. After all, they’re clearly just part of the platform’s ABI! What are they going to do, change the size of intmax_t!? That’s obviously an ABI-breaking change!

There is a reason it is called intMAX_t! It does not HAVE a definite size, it is the MAXimal size of an integer on that platform. Yes, they are problems nowadays due to ossification, but they come exactly from people like that blog author. When you want your program to have a stable ABI, that doesn't change when your platform supports larger integer types, you just don't use intMAX_t!

> And even then you have the x64 int problem: it’s such a fundamental type, and has been that size for so long, that countless applications may have weird undetectable assumptions about it. This is why int is 32-bit on x64 even though it was “supposed” to be 64-bit: int was 32-bit for so long that it was completely hopeless to update software to the new size even though it was a whole new architecture and target triple!

That is called ossification. When you program C you are not supposed to care about the sizes. When your program does, your program is broken/non-portable. Yes, this limits the compilers, because they don't want programs to be broken. But is really the same as e.g. MS Windows catering to a specific program's bugs. This is not a design mistake of C:

> sometimes you make a mistake so bad that you just don’t get to undo it.

aw1621107 6 hours ago [-]
> I don't know why the choice to use the compiler to implement parsing a C header is framed as a bad thing.

Not sure I agree with this interpretation, though maybe I'm focusing on a different part of the article than you are. Where you are getting the negative sense from?

That being said, I don't think it's too hard to imagine why someone might be a bit hesitant to use a C/C++ compiler to parse C/C++ headers - for example, it can be a pretty big dependency to take on, may add friction for devs/users, and integration with your own tool may be awkward and/or an ongoing time sink especially if you're crossing an FFI boundary or if the API you're using isn't stable (as I believe it is the case for LLVM).

> There is a reason it is called intMAX_t! It does not HAVE a definite size, it is the MAXimal size of an integer on that platform.

I think this somewhat misses the point of the bit you quoted. In context, it's basically saying that grabbing "real" C type info for interop is so painful that people will hard-code "reasonable" assumptions instead.

> When you want your program to have a stable ABI, that doesn't change when your platform supports larger integer types, you just don't use intMAX_t!

My impression is that the problem is less intmax_t changing and more that intmax_t can change out of sync. Even if you assume every use of intmax_t in a public API corresponds to an intentional desire for the bit width to evolve over time, you can still run into nasty issues if you can't recompile everything at once (which is a pretty strong constraint on the C/C++ committees if history is any indication).

tomhow 24 hours ago [-]
Previously:

Writing a C compiler in 500 lines of Python - https://news.ycombinator.com/item?id=37383913 - Sept 2023 (165 comments)

MarsIronPI 21 hours ago [-]
I find it surprising that a single-pass compiler is easier to implement than a traditional lexer->parser->AST->emitter. (I'm not a compiler expert, though.) I'd have expected that generating an AST would be at least as simple, if not simpler. Plus by generating an AST, doing some simple optimization is a lot easier: one can pattern-match parts of the AST and replace them with more efficient equivalents. Maybe I'm overthinking this, though. I tend to like extensible program designs, even when they don't necessarily make sense for the scale of the program…

Still a really cool article and an impressive project, though. I especially like the StringPool technique; I'll have to keep it in mind if I ever write a compiler!

marssaxman 18 hours ago [-]
A single-pass compiler is easier to implement in part because you're not going to do any of that optimization. You're writing a single-pass compiler either because you're banging out a quick sketch of an idea, and you don't care about production use, or because you've time-traveled back to the '70s or the '80s, where processors were so painfully slow and memory so eye-wateringly expensive that you might not even be able to read the entire source file into RAM at once, much less convert it all into some intermediate representation before starting to write out the machine code.
dlcarrier 11 hours ago [-]
I used to write software on my HP calculator, when I was bored in classes, before smartphones existed. It used a tokenized language called RPL, and the editor would read and parse just enough tokens to fill up the screen.

I miss how fast it was, compared to modern computers. VS Code is much laggier in comparison.

fuzztester 6 hours ago [-]
Which model of HP calculator?

And what kinds of programs could you write with it?

comonoid 11 hours ago [-]
C was designed to be compiled with a single-pass compiler.
Joker_vD 8 hours ago [-]
It wasn't; the original C compiler had two passes, and built expression trees in the first pass which the second pass would turn into assembly (and the original as on UNIX also had two passes).
whobre 7 hours ago [-]
C was not designed at all. It evolved from B, by adding types to it.
fuzztester 6 hours ago [-]
Yes.

And B evolved from BCPL:

https://en.m.wikipedia.org/wiki/BCPL

I had read the book about BCPL by Martin Richards, creator of the language. it was quite interesting. IIRC, after describing the language, he gave some examples of small system software routines or utilities, written in it.

https://en.m.wikipedia.org/wiki/Martin_Richards_(computer_sc...

kragen 20 hours ago [-]
I think this might depend on the language you're writing in.

Historically, at least, it's pretty verbose to define a data type in Python compared to languages that are more designed for writing compilers. Consider these definitions from my prototype Bicicleta interpreter, which is written in ML, specifically OCaml:

    type methods = NoDefs
                               (* name, body, is_positional ... *)
                   | Definition of string * bicexpr * bool * methods
    and bicexpr = Name of string
                  | Call of bicexpr * string
                  | Literal of string option * methods
                  | Derivation of bicexpr * string option * methods
                  | StringConstant of string
                  | Integer of int
                  | Float of float
                  | NativeMethod of (lookup -> bicobj)
Those ten lines of code would be ten classes in Python with an average of 1.6 attributes each. Using dataclasses or attrs, that would be 36 lines of code, and then (if you're doing it the OO way) every function that I defined on one of these OCaml types becomes a method implemented in each class implementing a particular protocol, with a copy of its argument signature in every class. (If you used namedtuple instead, it's no less code, but you write it on less lines.) So, for example, this function on bicexprs

    let rec freevars = function
        Name n -> stringset [n]
      | Integer _ | StringConstant _ | Float _ -> stringset ["prog"]
      | NativeMethod _ -> stringset []
      | Literal (Some selfname, methods) -> 
            StringSet.diff (freevars_methods methods) (stringset [selfname])
      | Literal (None, methods) -> freevars_methods methods
      | Derivation(object_, self, methods) ->
            StringSet.union (freevars object_) (freevars (Literal(self, methods)))
      | Call(object_, _) -> freevars object_
becomes six to eight method definitions in the different classes. (You can cut it down to six if you define an abstract base class for the constant classes.) And Literal.freevars needs an if-then-else. So that's another 20 lines of code.

Python does support pattern-matching now, so functions like this might not be any more verbose than the ML version if you program them the same way instead of in the OO fashion. I haven't tried using Python pattern-matching, so I don't really know.

In general, though, Python is more verbose than ML-family languages for this kind of thing by a factor of about 2–4, and that's before you count the test code you need in Python to get the kind of confidence in correctness that ML's type-checking gives you with no extra code. To my knowledge, Mypy doesn't do the kinds of pattern-matching-exhaustiveness checks that ML compilers do.

I've sometimes "cheated" by trying to write code like this in Python using regular tuples rather than named tuples. You can definitely make it work, but it's a real pain to debug.

Quoting Andy Chu from https://andychu.net/projects/:

> Python is not the right language for [implementing] languages. I will use OCaml for subsequent projects like this.

Python does have GC and dynamic dispatch, though, and those count for a lot.

vidarh 14 hours ago [-]
GC doesn't matter much for a simple compiler, as you either don't need to allocate much (single pass, Wirth-style compilers that generate code inline) or most of what you allocate becomes garbage all at once at the end (AST).

In my half-finished Ruby compiler prototype, even before I added type tagging, and so allocated every integer on the heap, I just didn't add a GC for a long time because it was fine to just leak, because the compiler isn't generally long running.

kragen 14 hours ago [-]
Yeah, I never got around to writing a GC for Ur-Scheme either—you could argue that it's also "half-finished" but it does successfully compile itself correctly. It's not so much the GC that matters as it is the ability to allocate on the heap without much ceremony. I should have thought of that before I posted, and I appreciate the correction.
MarsIronPI 15 hours ago [-]
My impression is that in general the traditional approach of methods as members of a class is more verbose and less extensible than the ML/Lisp generic function approach. I know I certainly prefer generic functions when I have to design polymorphic interfaces.
kragen 13 hours ago [-]
"Generic functions" is the Common Lisp name for writing a separate method for each class, the same as in Python except that you also have to define the generic function itself before you can define the methods. I'm not sure if that's what you meant; the ML approach is quite different.

This is Common Lisp, which I am not an expert in:

    ;;; Stupid CLOS example.

    (defgeneric x (point))                  ; make X a method
    (defgeneric y (point))                  ; make Y a method

    (defclass rect-point ()
      ((x :accessor x :initarg :x)
       (y :accessor y :initarg :y)))

    (defclass polar-point ()
      ((radius :accessor radius :initarg :radius)
       (angle  :accessor angle  :initarg :angle)))

    (defmethod x ((p polar-point))
      (* (radius p) (cos (angle p))))

    (defmethod y ((p polar-point))
      (* (radius p) (sin (angle p))))

    (defgeneric move-by (point Δx Δy))

    (defmethod move-by ((p rect-point) Δx Δy)
      (incf (x p) Δx)
      (incf (y p) Δy))

    (defmethod move-by ((p polar-point) Δx Δy)
      (let ((x (+ (x p) Δx)) (y (+ (y p) Δy)))
        (setf (radius p) (sqrt (+ (* x x) (* y y)))
              (angle p) (atan y x))))

    (defmethod print-object ((p polar-point) stream) ; standard method print-object
      (format stream "@~a<~a" (radius p) (angle p)))

    (defvar p1 (make-instance 'rect-point :x 3 :y 4))
    (defvar p2 (make-instance 'polar-point :radius 1 :angle 1.047))

    ;; prints (3, 4) → (4, 5)
    (format t "(~a, ~a) → " (x p1) (y p1))
    (move-by p1 1 1)
    (format t "(~a, ~a)~%" (x p1) (y p1))

    ;; prints @1<1.047 (0.500171, 0.8659266) → @1.9318848<0.7853087 (1.366171, 1.3659266)
    (format t "~a (~a, ~a) → " p2 (x p2) (y p2))
    (move-by p2 .866 .5)
    (format t "~a (~a, ~a)~%" p2 (x p2) (y p2))
Here's a similar program in OCaml, which I am also not an expert in, using pattern-matching functions instead of methods, and avoiding mutation:

    (* Stupid OCaml example. *)

    type point = Rect of float * float | Polar of float * float

    let x = function
      | Rect(x, y) -> x
      | Polar(r, theta) -> r *. Float.cos theta

    let y = function
      | Rect(x, y) -> y
      | Polar(r, theta) -> r *. Float.sin theta

    let moved_by = fun dx dy ->
      function
      | Rect(x, y) -> Rect(x +. dx, y +. dy)
      | p ->
         let x = dx +. x p and y = dy +. y p in
         Polar(Float.sqrt(x *. x +. y *. y),
               Float.atan2 y x)

    let string_of_point = function
      | Rect(x, y) -> Printf.sprintf "Rect(%f, %f)" x y
      | Polar(r, theta) -> Printf.sprintf "Polar(%f, %f)" r theta

    ;;

    print_endline(string_of_point(moved_by 1. 2. (Rect(3., 4.)))) ;
    print_endline(string_of_point(moved_by 0.866 0.5 (Polar(1., 1.047))))
MarsIronPI 4 hours ago [-]
> I'm not sure if that's what you meant; the ML approach is quite different. There is a difference in approach because in Common Lisp each method is a separate function definition (though macros can alleviate this), but my point is that both CL and ML are more function-oriented, if you will; i.e. "methods" (or whatever you want to call ML pattern-matched functions) aren't defined in a class body and are just ordinary functions.

I think this more function-focused approach is more elegant, but also more extensible and possibly less verbose when dealing with multiple classes that share the same interface.

> the same as in Python except that you also have to define the generic function itself before you can define the methods. As a side note, though it's not terribly important, the "defgeneric" can be omitted if you don't care to specify docstring or any special behavior.

arjvik 21 hours ago [-]
Not sure if fewer LoC necessarily implies easier!
markus_zhang 19 hours ago [-]
I think it depends on the language. I heard Turbo Pascal was pretty fast because 1) Pascal’s language features, 2) no optimization in TP 1.0 at least.
MarsIronPI 17 hours ago [-]
Fast to run or fast to compile?
markus_zhang 15 hours ago [-]
Fast to compile. It was very fast comparing to the others then. Since computers were slow (8086/80286) compiling speed actually mattered.

https://www.pcengines.ch/tp3.htm

MarsIronPI 15 hours ago [-]
As an ex-Gentoo user I can confirm that compiling speed still matters on certain machines.
markus_zhang 14 hours ago [-]
I have never worked as a system programmer so I never had the need to compile something big. I guess it could be an issue with very large software like Oracle Database, the Linux kernel and other similar stuffs.
ceronman 11 hours ago [-]
Very cool. I think Wasm is a nice instruction set, but I agree that its structured control flow is a bit weird and also the lack of instructions to handle the memory stack. But it's much more cleaner than something like x86_64.

If you are interesting in learning in more detail how to write a C compiler, I highly recommend the book "Writing a C Compiler" by Nora Sandler [0]. This is a super detailed, incremental guide on how to write a C compiler. This also uses the traditional architecture of using multiple passes. It uses its own IR called Tacky and it even includes some optimization passes such as constant folding, copy propagation, dead code elimination, register allocation, etc. The book also implements much more features, including arrays, pointers, structs/unions, static variables, floating point, strings, linking to stdlib via System V ABI, and much more.

[0] https://norasandler.com/book/

alienbaby 3 hours ago [-]
I thought I had learned a new word reading this, but instead I just have something that seems like it should be a word given the context it was discovered in. Perhaps that in itself should be considered cremement. A word that looks like it should be a word but isn't.
Liftyee 23 hours ago [-]
This article breaks it down well enough to make me feel like I could write my own C compiler targeting AVR. (I probably could... but it would not be easy.)

Never actually looked into how compilers work before, it's surprisingly similar/related to linguistics.

measurablefunc 22 hours ago [-]
It's b/c when Chomsky invented the theory of formal grammars he was studying natural languages & the universality of abstract grammar¹. Computer scientists realized later that they could use the same theory as a foundation for formalizing the grammatical structures of programming languages.

¹https://en.wikipedia.org/wiki/Chomsky_hierarchy

userbinator 14 hours ago [-]
You should study C4, which is a C(subset) compiler in ~500 lines, but more interestingly, it can compile itself: https://news.ycombinator.com/item?id=8558822
dekhn 20 hours ago [-]
Similar experience in DNA/genome analysis. A large part of DNA analysis was based on parser theory.

This paper was my introduction to DNA analysis as well as Chomsky hierarchy: https://www.jstor.org/stable/29774782 (I wasn't able to find a free copy).

IIRC, pseudoknots in RNA require context-free grammars to parse.

lukan 20 hours ago [-]
"compilers work before, it's surprisingly similar/related to linguistics."

Since compilers transform languages with a clearly defined grammar ... the connection to linguistics is maybe not so surprising after all.

keyle 14 hours ago [-]
I love that graphic, so many nuggets in there, a very cute depiction of a compiler in general.
13 hours ago [-]
weregiraffe 1 days ago [-]
Now write a Python compiler in 500 lines of C.
bluGill 21 hours ago [-]
I could probably do it - but you wouldn't like it. My dictionaries would be a linked-list, looking for a key becomes a linear search... (if you gave me C++ I'd use std::map) I'm assuming you will allow me to use the C standard library, if I have to implement strlen or malloc in that 500 lines of C I'm not sure I can pull that off. 500 lines is aggressive, but IOCCC gives me plenty of tricks to get the line count down and the language isn't that big. I'm also going to assume 100% valid python code is fed in, if there is a bug or error of any sort that is undefined behavior.

Note that most of what makes python great isn't the language, it is the library. I believe that large parts of the python library are also written in C (for speed), and thus you won't be able to use my 500 line python compiler for anything useful because you won't have any useful libraries.

jlokier 5 hours ago [-]
If you can search (and do other operations) on a linked list, and the lists are long enough to matter, you can trivially speed it up to a fixed-size hash table with hardly any code changes.

This:

    entry = list_search(key, list);
becomes:

    entry = list_search(key, lists[hash(key) % (sizeof(lists)/sizeof(lists[0]))]);
This:

    list_add(key, entry, list);
becomes:

    list_add(key, entry, lists[hash(key) % (sizeof(lists)/sizeof(lists[0]))]);
etc.

A simple hash(), not high quality but good enough for non-adversarial inputs, can be a one-liner.

A good dictionary does more than this of course, especially dynamic sizing, but this simple change can radically speed up simple linked list code when the lists are long, you don't have a hash table implemention you can use, and you want to keep the code change very small.

The same principle can be used with other things than lists too.

bluGill 39 minutes ago [-]
Those would be useful optimizations once the simplest thing works. However the goal here is 500 lines- not fast, correct, maintainable code. If I was to write this for real world use I wouldn't start with C in the first place.
dekhn 20 hours ago [-]
A hash table in C is about 30 lines of code, so I don't think you have to stick to linked lists for dictionaries.
threeducks 8 hours ago [-]
9 lines seem to be sufficient (assuming string keys and int values).

    // Hashtable definition
    #include <string.h>
    #define N 1024
    int* map_ptr(const char **keys, int *values, const char *key){
        size_t h = 0, c = 0, i;
        for (const char *c = key; *c; c++) h = h * 33 + *(unsigned char*)c;
        for (i = h % N; c < N && keys[i] && 0 != strcmp(keys[i], key); i = (i + 1) % N, c++);
        if (!keys[i]) keys[i] = key;
        return 0 == strcmp(keys[i], key) ? &values[i] : NULL;
    }

    // Example usage
    const char *keys[N];
    int values[N];

    #include <stdio.h>

    int main(){
        // Set some values
        *map_ptr(keys, values, "one") = 1;
        *map_ptr(keys, values, "two") = 2;
        *map_ptr(keys, values, "three") = 3;

        // Retrieve values
        printf("one: %i\n", *map_ptr(keys, values, "one"));
        printf("two: %i\n", *map_ptr(keys, values, "two"));
        printf("three: %i\n", *map_ptr(keys, values, "three"));

        return 0;
    }
ludocode 19 hours ago [-]
Indeed, a decent closed hash table is maybe 30 lines. An open hash table with linear probing is even less, especially if you don't need to remove entries. It's almost identical to a linear search through an array; you just change where you start iterating.

In my first stage Onramp linker [1], converting linear search to an open hash table adds a grand total of 24 bytecode instructions, including the FNV-1a hash function. There's no reason to ever linear search a symbol table.

[1]: https://github.com/ludocode/onramp/blob/develop/core/ld/0-gl...

bluGill 19 hours ago [-]
a linear search may be faster because it is cache and branch prediction frienly. Benchmarks on real world data is needed to make a final call.
bluGill 19 hours ago [-]
Sure but a linear search is 5. when my limit is 500 lines of C I don't dare spare those lines.
wyldfire 23 hours ago [-]
A python VM that consumes bytecode might be doable in not-ludicrous-amounts of C. Not 500 lines I suppose. But something manageable I think? Especially if you targeted the older releases.
jonjacky 14 hours ago [-]
In the CPython reference interpreter, that VM can be found at

https://github.com/python/cpython/blob/main/Python/ceval.c

It's 3619 lines. It's explained in this 515 line file:

https://github.com/python/cpython/blob/main/InternalDocs/

For comparison, there is a pure Python bytecode interpreter, its VM is here:

https://github.com/nedbat/byterun/blob/master/byterun/pyvm2....

It's 1043 lines.

nickpsecurity 21 hours ago [-]
Maybe 500 lines of Pythonic, macro-heavy C. If the macros' LOC don't count. Maybe.
TZubiri 21 hours ago [-]
Not to be that guy, but Python is an interpreted language.

That said, I guess technically you could make something that compiles python to an executable? This is hacker news after all

vidarh 14 hours ago [-]
A language is not inherently interpreted or compiled.

Some languages are more or less easy to compile efficiently and without embedding a JIT compiler, but any language can be compiled.

For Python in particular, there are already compilers.

If you want a nightmarish language to compile, look at Ruby. There are compilers even for Ruby.

nurettin 13 hours ago [-]
Python has the same amount of nightmare. Maybe even more. You can add static class and instance accessors at runtime, it supports full monkey patching just like Ruby does. You can meta program modules, classes, objects, you can decorate classes and functions, declare functions and lambdas anywhere. "compilers" usually disallow monkey business and compile only a subset.
vidarh 8 hours ago [-]
While my (half finished, buggy) Ruby compiler is definitely a subset, it allows the monkey business. There are ways to do it reasonably, but making it fast gets a lot harder when you do...
nurettin 4 hours ago [-]
Best in Class seems to be JRuby, but it needs a JVM in between.
zahlman 12 hours ago [-]
This isn't even correct for the one specific reference implementation you're presumably thinking of. It is just as much "compiled" as Java or C#, just that the compilation is often done on the fly. (Although IIRC C# does some additional trickery to pretend that its bytecode is "real" executable code, wrapped in a standard .exe format.) Presumably you've noticed the __pycache__ folders containing .pyc files; those are compiled bytecode. When you install Python, typically the standard library will all be precompiled (at least in part, so that those bytecode files can be created by an admin user now, and used by a standard user later). There is an interpretive REPL environment, but that works by doing the same bytecode-compilation each time.
tvickery 24 hours ago [-]
[flagged]
rossant 22 hours ago [-]
Now do it without imports.
emilbratt 22 hours ago [-]
That is not a compiler. That is called a wrapper script. But funny none the less.
amszmidt 22 hours ago [-]
The original cc was just a wrapper like this Python example around a bunch of external programs, calling c00, c01, until something could be fed to as and then linked using ld.

GCC does basically the same thing even today,

01HNNWZ0MV43FF 21 hours ago [-]
yeah but c00 and c01 actually do stuff
TZubiri 21 hours ago [-]
So does gcc
24 hours ago [-]
TZubiri 21 hours ago [-]
We've come full circle
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 18:00:06 GMT+0000 (Coordinated Universal Time) with Vercel.