NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
100M-Row Challenge with PHP (github.com)
Xeoncross 3 minutes ago [-]
This is why I jumped from PHP to Go, then why I jumped from Go to Rust.

Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.

Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.

If Rust scares you use Go. If Go scares you use Rust.

brentroose 5 hours ago [-]
A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds. This optimization process with so much fun, and so many people pitched in with their ideas; so I eventually decided I wanted to do something more.

That's why I built a performance challenge for the PHP community

The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).

I hope people will have fun with it :)

Tade0 1 hours ago [-]
Pitch this to whoever is in charge of performance at Wordpress.

A Wordpress instance will happily take over 20 seconds to fully load if you disable cache.

embedding-shape 1 hours ago [-]
Microbenchmarks are very different from optimizing performance in real applications in wide use though, they could do great on this specific benchmark but still have no clue about how to actually make something large like Wordpress to perform OK out of the box.
monkey_monkey 29 minutes ago [-]
That's often a skill issue.
user3939382 2 hours ago [-]
exec(‘c program that does the parsing’);

Where do I get my prize? ;)

brentroose 2 hours ago [-]
The FAQ states that solutions like FFI are not allowed because the goal is to solve it with PHP :)
kpcyrd 45 minutes ago [-]
What about using the filesystem as an optimized dict implementation?
olmo23 12 minutes ago [-]
this is never going to be faster because it requires syscalls
gib444 2 hours ago [-]
> A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds

That's a huge improvement! How much was low hanging fruit unrelated to the PHP interpreter itself, out of curiosity? (E.g. parallelism, faster SQL queries etc)

brentroose 2 hours ago [-]
Almost all, actually. I wrote about it here: https://stitcher.io/blog/11-million-rows-in-seconds

A couple of things I did:

- Cursor based pagination - Combining insert statements - Using database transactions to prevent fsync calls - Moving calculations from the database to PHP - Avoiding serialization where possible

tiffanyh 1 hours ago [-]
Aren’t these optimizations less about PHP, and more about optimizing how your using the database.
hu3 56 minutes ago [-]
It's still valid as as example to the language community of how to apply these optimizations.
swasheck 52 minutes ago [-]
in all my years doing database tuning/admin/reliability/etc, performance have overwhelmingly been in the bad query/bad data pattern categories. the data platform is rarely the issue
pxtail 2 hours ago [-]
Side note - I wasn't aware that there is active collectors scene for Elephpants, awesome!

https://elephpant.me/

t1234s 1 hours ago [-]
Elephpants should be for second and third place. First place should be the double-clawed hammer.
thih9 1 hours ago [-]
Excellent project. My favorites: the joker, php storm, phplashy, Molly.
tveita 1 hours ago [-]
> Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as others

Except that the generator script generates dates relative to time() ?

Retr0id 2 hours ago [-]
How large is a sample 100M row file in bytes? (I tried to run the generator locally but my php is not bleeding-edge enough)
brentroose 2 hours ago [-]
Around 7GB
spiderfarmer 2 hours ago [-]
Awesome. I’ll be following this. I’ll probably learn a ton.
wangzhongwang 1 hours ago [-]
[dead]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 15:03:31 GMT+0000 (Coordinated Universal Time) with Vercel.