ADD / XOR / ROL

Wednesday, January 31, 2024

The end of my Elastic/optimyze journey ...

Hey all,

== tl;dr ==

Today is my last day at Elastic. I'll take an extended break and focus on rest, family, health, writing, a bit of startup mentoring/investing, and some research - at least for a while.

I'm thankful for my great colleagues and my leadership at Elastic - y'all are stellar, even if I was often grumbly about some technical or architectural issues. I'll also miss the ex-optimyze team a lot; you were the best team anyone doing technically sophisticated work could wish for - great individuals, but in sum greater than the parts. I think the future for the tech we built is bright, particularly in light of the recent Otel events :)

========

Extended Version:

Today is my last day at Elastic, and with that, the last day of my journey with optimyze. I am leaving with a heavy heart, and complicated emotions. The 5 years of optimyze (3 years optimyze, 2 years optimyze-integration-into-Elastic) were intense - moderately intense on the work front, but extremely intense on the life front. Fate somehow managed to cram a lot of the ups and downs of midlife into a very small number of years.

A timeline:

I left Google on the 31st of December 2018, and started optimyze.cloud in February 2019. I was highly motivated by the idea of building a company that aligns my ecological, economic, and technical interests. I visited the RSA conference in SF in spring 2019 to network and get people interested in our "cut-of-savings" consulting approach. I met Corey Quinn for coffee, and to this day much appreciate all the sage advice he had (even if I had to ignore some and learn the hard lesson myself).
In May 2019, I was elated to (finally!) become a father for the first time.
During 2019, my co-founder Sean and me mostly spent our time trying to get our "cut-of-savings" consulting business of the ground, only to be thwarted by the unfortunate combination that (a) companies nimble enough to do it were too small to make it worth it, and (b) companies big enough to make it worth it couldn't figure out how to make the contract work from a legal and accounting perspective.
We did a few small gigs with friendly startups, and realized in late summer that a zero-instrumentation, multi-runtime, fleet-wide profiler was sorely missing as a product. We also realized that with BOLT making progress, there'd be real value in being a SaaS that sits on profiling data from different verticals. Hence the vision for optimyze.cloud as a product company was born.
By late 2019, we had a prototype for unwinding C/C++ stacks using .eh_frame, and Python code, both from eBPF. We knew we could be really zero-friction in deployment, which made us very happy and excited.
We decided to raise funding, and did so over the winter months - with the funding wire transfer finally hitting our (Silicon Valley Bank) account some time in early 2020. We started building, and hiring what would turn out the best team I've ever worked on.
We had a working UI and product by late fall 2020, and the first in-prod deployments around the same time. One particular part of the stack was too slow (a particular query that we knew we'd need to move to a distributed K/V store, but hadn't done yet), and we spent the next few months rebuilding that part of the stack to use Scylla.
We made some very bad calls on the investor relations front, I foolishly stumbled into a premature, fumbled, and retrospectively idiotic fundraise, into the middle of which my second child was born and the first acquisition offers came in.
We launched Prodfiler in August 2021, to great acclaim and success. People loved the product, they loved the frictionless deployment, they loved the fact that all their stack traces were symbolized out of the box etc. - the product experience was great.
In mid-October, we were acquired by Elastic with the closing date November 1st. My mother had a hip surgery from which complications arose, which led to her being transferred into an ICU.
The day the deal closed, my mother fell into a coma, and she would never wake up again. I spent the next weeks shuttling back and forth between Zurich (where my wife and my two kids were) and Essen, Germany, to spend time bedside in the ICU.
My mother died in the morning hours of Jan 1st 2022, a few hours after the fireworks.
My elderly father needed a lot of help dealing with the aftermath; at the same time the transition into the Elastic tech stack was technically challenging to pull off.
In Summer 2022, my father stumbled after a small leg surgery, fell, and hit his head; after some complications in the German medical system, it became clear that the injury had induced dementia. We transferred him to a specialist hospital in Berlin and ultimately to a care home close to my brother's family. Since then, I've been shuttling back and forth to see him often.
After two years of hard work at Elastic, we finally managed to launch our product again in fall 2023.

So the entire thing was 5 years, in which I had two children, started a company, hired the best team I've known, launched a product I was (and am) immensely proud of, then lost my mother, most of my father ... and "reluctantly let go" of the company and product.

The sheer forces at play when you cram so much momentum into such a short time-frame will strain everybody; and they will strain everybody's support system. I'm extremely grateful for my entire support system, in particular my brother. I don't know how I would've fared without him, but I hope my kids will have as good a relationship with each other as I do with my brother.

I'm also grateful to the folks at Elastic and the optimyze team, who were extremely supportive and understanding as I was dealing with complications outside of work.

I'm proud that we managed to build, I am also proud that we managed to port it to the Elastic stack and re-launch it. Even after more than 2 years focused on porting the back-end, our profiler remains ahead of the competition. I'm optimistic about what Elastic and the team can build on top of our technology, in particular with OTel profiling moving toward reality.

At the same time, I am pretty spent. My productivity is nowhere near where I expect it to be (it never is - I have difficulty accepting that I am a finite human - but the gap is bigger than usual), and this leads to me having difficulty switching off: When I feel like I am not getting the things I want to get done done, my brain wants to compensate by working more - which is rarely the right step.

So, with a heavy heart, I decided that I will take an extended break. It's been intense, and emotional, and I need some time to rest and recover, and accompany my father on his last few steps into the darkness (or light?). 2019 and 2020 were among the happiest years of my life, the last chunk of 2021 and most of 2022 the most difficult parts of my life. 2023 was trending up, and I expect things to continue trending up for the foreseeable future.

I have planned to do a bit of writing (I think having done two companies, one bootstrapped and one with VC money, gives me a few things I'd like to pass on), perhaps a bit of angel investing or VC scouting, perhaps a bit of consulting where things of particular interest arise - but mostly, I intend to stretch, breathe, be there for my kids, and get a clear view of the horizon.

Monday, December 11, 2023

A list of factors that act(ed) as drag on the European Tech/Startup scene

This post is an adaption of a Twitter thread where I listed the various factors that in my experience led to a divergence of the trajectories of the US tech industry around Silicon Valley (SV) and the tech industry in Europe. Not all of these factors are current (some of the cultural ones are less pronounced today than they used to be), and some of them could be relatively easily fixable.

I'll add a separate post on policy suggestions at a later point.

I should also note that there's many great things about Europe -- I still live here, I'd build my next company here, and I don't think I'd ever want to migrate to SV. I'll also write about the advantages in the future.

Now, on to the list, which was spawned by a thread with @martin_casado and @bgurley on the website previously known as Twitter.

Cultural factors: When I was growing up in the 90s, there was significant uncertainty in the labor market, and one way to achieve economic security was seeking a government job. In many European countries, running a limited liability construct into insolvency effectively bans you from running another one in the foreseeable future. The mentality of "start a company in your 20s, and if you fail, you can either try again or get a job" wasn't a thing. So we are operating from a risk-averse base, due to a labor market with then-sluggish job creation and strong incumbent effects. (Bert Hubert has written a more extensive article on the cultural factors here).
A terrifyingly fragmented market, along legal, linguistic, and cultural lines. Imagine every US state had its own language, defense budget, legal system, tax system, culture, employment law etc. - in the US, you build a product and you tap into a market of 340m people. The biggest market in Europe is Germany at 80m, not even a quarter of the size. Then France (65m), Italy (59m), Spain (47m), and then things fragment into a long tail. By the time you hit 340m customers, you're operating in 9-10 countries, 7+ languages and legal systems etc.
Equally fragmented capital markets that are individually much smaller. Take the US stock market and cut it into 10+ pieces. This has knock-on effects for IPOs: IPOs, when they happen, tend to be much smaller. Raising large amounts of capital is more difficult, while big wins are smaller. This has terrible knock-on effects all the way down to seedstage VCs: If the power law home run you're angling for is 1/10th the size of the home run in the US, early stage investors need to be way more risk averse. You can see this even today where most European VC funds will offer less money at worse terms than their US counterparts. It was much worse in 2006-2007, when the Samwers were almost the only game in town for VC in the EU.
Smaller IPOs also mean that it is comparatively much more attractive to sell to an existing (US-based) giant.
The absence of a DARPA to shoulder fundamental research risks in technology. Different stages of R&D require different investors. The government is in the strange situation that they can indirectly benefit from investments without having an ownership stake because it gets to tax GDP. That means at the extremely high risk end of R&D, fundamental research, it can afford to just finance many many long shots blindly and (comparatively) simply, as it doesn't need to track ownership. So how do you fund fundamental R&D without it devolving into scholasticism? Interestingly, the most basic test ("can I use this to cause some damage") is already helpful. Europe's defense sector has never since WW2 grasped it's role in advancing technology, and it's terribly fragmented, underfunded, and can't do much research. DARPA has financed the early-stage development of many enabling technologies. Having a guaranteed customer (DoD) for high risk research has enabled better and higher risk-taking, and had large downstream effects.
Terrible legislation with regards to employee stock options. People talk about how many big companies in Europe are family-owned as if that is something good. It's also a symptom of legal systems that make (or made) it terribly difficult to give lots of equity to early employees. This is slowly changes through concerted lobbying, but it is still difficult in most jurisdictions, and not unified at all.
The way the EU is constructed where the EU gives a directive and each country implements it's own flavor is worst-case for legal complexity. Imagine if every state got to re-implement its own flavor of each federal law.
Founder Brain Drain. Why would an ambitious founder not go to where the markets are bigger, capital is easier to raise on better terms, and incentivizing early employees is easier?
Ecosystem effects permit risk-taking by employees in SV. SV has such strong demand for talent that an employee can "take risks" on early stage startups because the next job is easy to get. If you live in a place with just 1-2 big employers, leaving with intent to return is riskier.
Network effects and path dependence. The fragmentation of the market led to smaller players in search and ads that then sold to larger US-based players. Without the deep revenue streams, no European player had the capital or expertise to go into cloud. As a result, there is no European player with enough compute, or datasets, or capital to effectively compete in cloud or AI. China has homegrown players, even Russia has to some extent, Europe's closest equivalent are OVH and Hetzner, which sell on price, not on higher-level services.
GDPR after effects: EUparl saw that in situations where US states are fragmented they can act as a standards body, and there's a weird effect of "if we cannot be relevant through tech, we can still be relevant through shaping the legal landscape", and that's what leads to this terrible idea of "Europe as regulatory superpower", where it is more important for members of EUparl to have done "something" than having done "something right" - a mentality that seems to prefer bad regulation over no regulation, when good regulation would be needed. GDPR led to higher market concentration in Ads, which arguably undermines privacy in a different way, and it's imposed huge compliance and convenience cost on everybody. But in EUparl it's celebrated as success, because hey, for once Europe was relevant (even if net effects are negative).
Pervasive shortsightedness among EU national legislators, undermining the single market and passing poor laws with negative side effects for startup and capital formation. The best example is Germans "exit tax": Imagine you are an Angel Investor in the US but if you move out of state it triggers immediate cap gains on all your illiquid holdings/Angel Investments at the valuation of the last round. It essentially means you can't angel invest if you don't know if you'll have to move in the next 8-10 years because you don't know if you can afford the tax bill. It's hair-raisingly insane, and likely illegal under EU rules, but who wants to fight the German IRS in European court?

I think these are the most important factors that come to mind. I'll add more if I remember more of them.

Also, given that this post has a strong resonance with extreme "anti government" and "libertarian" types, please be aware that I am very much on a different area of the political spectrum (centre-left, somewhere where the social democrats used to reside historically in Germany). I am strongly in favor of good and competent regulation to ensure markets function, competition works, and customers are protected.

Tuesday, February 23, 2021

Book Review: "This Is How They Tell Me the World Ends"

This blog post is a review of the book "This Is How They Tell Me the World Ends" by Nicole Perlroth. The book tries to shed light on the "zero day market" and how the US government interacts in this market, as well on various aspects of nation-to-nation "cyberwarfare".

I was excited to see this book come out given that there are relatively few hats in this field I have not worn. I have worked in information security since the late 1990's; I was part of a youth culture that playfully pioneered most of the software exploitation techniques that are now used by all major military powers. I have run a business that sold technology to both defenders and offensive actors. I have written a good number of exploits and a paper clarifying the theoretical foundations for understanding them. I have trained governments and members of civil society on both the construction and the analysis of exploits, and on the analysis of backdoors and implants. I have spent several months of my life reading the disassembled code of Stuxnet, Duqu, and the Russian Uroburos. I spent half a decade at Google on supporting Google's defense against government attackers; I spent a few additional years in Project Zero trying to nudge the software industry toward better practices. Nowadays, I spend my time on efficiency instead of security.

I have always been close to, but never part of, the zero-day market. My background and current occupation give me a deep understanding of the subject, while not tying me economically to any particular perspective. I therefore feel qualified like few others to review the book.

"This Is How They Tell Me the World Ends" tackles an important question: What causes the vulnerability of our modern world to "cyberattacks"? Some chapters cover various real-world cyberattacks, some chapters try to shed light on the "market for exploits", and the epilogue of the book discusses ideas for a policy response.

The author managed to get access to a fantastic set of sources. Many things were captured on the record that were previously only discussed on background. Several chapters recount interviews with former practitioners in the exploit market, and these chapters provide a glimpse into the many fascinating and improbable personalities that make up the field. This is definitely a strong asset of the book.

Given the exciting and impactful nature of the "cyberwar" subject, the many improbable characters populating it, and the many difficult and nuanced policy questions in the field, the level of access and raw material the author gathered could have been enough for a fantastic book (or even two).

Unfortunately, "This Is How They Tell Me the World Ends" is not a fantastic book. The potential of the source material is diluted by a large number of inaccuracies or even falsehoods, a surprising amount of ethnocentricity and US-American exceptionalism (that, while being a European, I perceived to border on xenophobia), a hyperbolic narration style, and the impression of facts bent to support a preconceived narrative that has little to do with reality.

For the layperson (presumably the target audience of this book) the many half-truths and falsehoods make the book an untrustworthy guide to an important and difficult topic. For the expert, the book may be an entertaining, if jarring read, provided one has the ability to dig through a fair bit of mud to find some gold. I am confident that the raw material must be great, and where it shines through, the book is good.

Inaccuracies and Falsehoods

The topic is complex, and technical details can be difficult to get right and transmit clearly. A book without any errors cannot and should not be expected, and small technical errors should not concern the reader. That said, the book is full of severe and significant errors - key misunderstandings and false statements that are used as evidence and to support conclusions - and those do raise concerns.

I will highlight a few examples of falsehoods or misleading claims. I only found those falsehoods that overlapped with expertise of mine; extrapolating from this, I am afraid that there may be many more in the book.

The following examples are from the first third of the book; and they are illustrative of the sort of mistakes throughout: Facts are either twisted or exaggerated to the point of becoming demonstrably false; and these twists and exaggerations seem to always happen in support of a narrative that places an unhealthy focus on zero-days.

First, one of the more egregious falsehoods is the claim that NSA hacked into Google servers to steal data:

... the agency hacked its way into the internal servers at companies like Google and Yahoo to grab data before it was encrypted.

This simply did not happen. As far as anyone in the industry knows, in the case of Google, unencrypted network connections between datacenters were tapped. This may sound inconsequential, but undermines the central "zero days are how hacking happens" theme of the book.

Second, the entire description of zero-days is full of false claims and hyperbole:

Chinese spies used a single Microsoft zero-day to steal some of Silicon Valley's most closely held source code.

This alludes to the Aurora attacks on Google; but anyone that knows Google's internal culture knows that source code is not most closely held by design. Google always had a culture where every engineer could roam through almost all the code to help fix issues.

...Once hackers have figured out the commands or written the code to exploit it, they can scamper through the world's computer networks undetected until the day the underlying flaw is discovered

This is simply not true. While a zero-day exploit will provide access to a given machine or resource, it is not a magic invisibility cloak. The Chinese attackers were detected, and many other attackers are routinely detected in spite of having zero-day exploits.

...Only a select few multinationals are deemed secure enough to issue the digital certificates that vouch (...) that Windows operating system could trust the driver (...) Companies keep the private keys needed to abuse their certificates in the digital equivalent of Fort Knox.

This section is at best misleading: The driver in question was signed with a stolen JMicron "end-entity" certificate. There are thousands of those, all with the authority to sign device drivers to be trusted, and the due diligence to issue one used to be limited to providing a fax of an ID and a credit card number.

The "select few multinationals" Perlroth writes about here are the certificate authorities that issue such "end-entity" certificates. It is true that a CA is required to keep their keys on a hardware security module (a very high-security setup), and that the number of CAs that can issue driver-signing certificates is limited (and falling).

The text makes it appear as if a certificate from a certificate authority (and hence from a hardware security module) had been stolen. This is simply false. End-entity certificates are issued to hardware vendors routinely, and many hardware vendors play fast and loose with them.

(It is widely rumored - but difficult to corroborate - that there used to be a thriving black market where stolen end-entity certificates were traded a few years ago; the going rate was between $30k and $50k if I remember correctly.

Ethnocentricity and US exceptionalism

As a non-US person, the strangest part of the book was the rather extreme ethnocentricity of the book: The US is equated with "respecting human rights", everything outside of the US is treated as both exotic and vaguely threatening, and the book obsesses about a "capability gap" where non-US countries somehow caught up with superior US technology.

This ranges from the benign-but-silly (Canberra becomes the "outback", and Glenn Greenwald lives "in the jungles of Brazil" - evoking FARC-style guerillas, when - as far as I am informed - he lives in a heavily forested suburb of Rio) to seriously impacting and distorting the narrative.

The author seems to find it unimaginable that exploitation techniques and the use of exploits are not a US invention. The text seems to insinuate that exploit technologies and "tradecraft" were invented at NSA and then "proliferated" outward to potentially human-rights-violating "foreign-born" actors via government contractors that ran training classes.

This is false, ridiculous, and insulting on multiple levels.

First off, it is insulting to all non-US security researchers that spent good parts of their lives pioneering exploit techniques.

The reality is that the net flow of software exploitation expertise out of NSA is negative: Half a generation of non-US exploit developers migrated to the US over the years and acquired US passports eventually. The US exploit supply chain has always been heavily dependent on "foreign-born" people. NSA will enthusiastically adopt outside techniques; I have yet to learn about any exploitation technique of the last 25 years that "leaked" out of NSA vs. being invented outside.

The book's prologue, when covering NotPetya, seems to imply that Russia had needed the Shadowbrokers leaks - ("American weapons at its disposal") - to cause severe damage. Anybody with any realistic visibility into both the history of heap exploitation and the vulnerability development community knows this to be absolutely wrong.

Secondly, it seems to willfully ignore recent US history with regards to human rights. Somehow implying that the French police or the Norwegian government have a worse human rights track record than the US government - which unilaterally kills people abroad without fair trial via the drone-strike program, relatively recently stopped torturing people, and keeps prisoners in Guantanamo for 15+ years by having constructed a legal grey zone outside of the Geneva Conventions - is a bit rich.

In the chapter on Argentina, Ivan Arce calls the author out on her worldview (which was one of my favorite moments in the book), but this seems to have not caused any introspection or change of perspective. This chapter also reveals an odd relationship to gender: The narrative focuses on men wreaking havoc, and women seem to exist to rein in the out-of-control hackers. Given that there are (admittedly few, but extremely capable) women and non-binary folks active in the zero-day world, I find this narrative puzzling.

There is also an undercurrent that everything bad is caused by nefarious foreign intervention: The author expresses severe doubts that the 2016 US election would have had the same outcome without "Russian meddling", and in the Epilogue writes "it is now easier for a rogue actor to (...) sabotage (...) the Boeing 737 Max", somehow managing to link a very US-American management failure to vague evil forces.

In its centricity on the US and belief in US exceptionalism, its noticeable grief about the 2016 US election, and the vague suspicion that everything bad must have a foreign cause, the reader learns more about the mindset of a certain subset of the US population than about cybersecurity or cyberwarfare.

Hyperbolic language

The book is also made more difficult to read by constant use of hyperbolic language. Exploits are capable of "crashing Spacecraft into earth", "detonated to steal data", and things always need to be "the most secure", "the most secret", and so forth. The book would have benefitted from the editor-equivalent of an equalizer to balance out the wording.

The good parts

There are several things to like about the book: The chapters that are based on interviews with former practitioners are fun and engaging to read. The history of software exploits is full of interesting and unorthodox characters, and these chapters provide a glimpse into their world and mindsets.

The book also improves as it goes on: The frequency of glaring falsehoods seems to decrease - which lets the fact that it is generally engaging come through.

Depending on what one perceives the thesis of the book to be, one can also argue that the book advances an important point. The general subject - "how should US government policy balance offensive and defensive considerations" - is a deep and interesting one, and there is a deep, important, and nuanced discussion to be had about this. If the underlying premise of the book is "this discussion needs to be had", then that is good. The book seems to go much beyond this (reasonable) premise, and seems to mistakenly identify the zero-day market as the root cause of pervasive insecurity.

As a result, the book contributes little of utility to a defensive policy debate. The main drivers of the cyber insecurity are hardly discussed until the Epilogue: The economic misincentives that cause the tech industry to earn hundreds of billions of dollars from creating the vulnerabilities in the first place (for every million earned through the sale of exploits, an order of magnitude or two more is earned through the sale of the software that creates the security flaw), and the organisational misincentives that keep effective regulation from arising (NSA - rightly - has neither mission or authority to regulate the tech industry into better software, so accusing them of not doing so is a bit odd). By placing too much emphasis on governments knowing about vulnerabilities, the book distracts from the economic forces that create a virtually infinite supply of them.

The Epilogue (while containing plenty to disagree with) was one of the stronger parts of the book. The shortness makes it a bit shallow, but it touches on many points that warrant a serious discussion. (Unfortunately, it again insinuates that "ex-NSA hackers tutor Turkish Generals in their tradecraft"). If anything, the Epilogue can be used as a good (albeit incomplete) list of topics to discuss in any cybersecurity policy class.

Concluding thoughts

I wish the book realized more of the potential that the material provided. The debate about the policy trade-offs for both offense and defense needs to be had (although there is less of a trade-off than most people think: Other countries have SIGINT agencies that can do offense, and defensive agencies focused on improving the overall security level of society; fixing individual bugs will not fix systemic misincentives), and a good book about that topic would be very welcome.

Likewise, a book that gives a layperson a good understanding of the zero-day trade and the practitioners in the trade would be both useful and fascinating.

The present book had the potential to become either of the above good books - the first one by cutting large parts of the book and expanding the Epilogue, the second one by rigorous editing and sticking to the truth.

So I regret having to write that the present book is mostly one of unfulfilled potential, and that the layperson needs to consult experts before taking any mentioned "fact" in the book at face value.

Wednesday, September 16, 2020

The missing OS

Preface:

When I joined Google in 2011, I quoted a quip of a friend of mine:

"There are roughly one and a half computers in the world, and Google has one of them."

The world has changed quite a bit since 2011, and there may possibly be half a dozen computers in the world now. That said, for the following text to make sense, when I say "the computer", I mean a very large assembly of individual machines that have been connected to make them act like one computer.

Actual blog post:

The tech landscape of modern microservice deployments can be confusing - it is fast-changing, with a proliferation of superficially similar projects claiming to do similar things. Even to me as someone fairly deeply into technology, it isn't always clear what precise purpose the different projects serve.

I've quipped repeatedly about "Datacenter OS" (at least here and here), and mused about it since I first left Google for my sabbatical in 2015. I recently had the chance to chat with a bunch of performance engineers (who sit very much at the crossing between Dev and Ops), and they reminded me to write up my thoughts. This is a first post, but there may be more coming (particularly on the security models for it).

Warning: This post is pure, unadulterated opinion. It is full of unsubstantiated unscientific claims. I am often wrong.

I claim the following:

When we first built computers, it took a few decades until we had the first real "operating systems". Before a 'real' OS emerged, there were a number of proto-OS -- collections of tools that had to be managed separately and cobbled together. There were few computers overall in the world, and if you wanted to work on one, you had to work at a large research institution or organization. These machines ran cobbled-together OSs that were unique to that computer.

Since approximately 2007, we're living through a second such period: The "single computer" model is replaced with "warehouse-sized computers". Initially, few organizations had the financial heft to have one of them, but cloud computing is making "lots of individual small computers" accessible to many companies that don't have a billion of cash for a full datacenter.

The hyperscalers (GOOG, FB, but also Tencent etc.) are building approximations to a "proto-datacenter-OS" internally; Amazon is externalizing some of theirs, and a large zoo of individual components for a Datacenter-OS exist as open-source projects.

What does not exist yet is an actual complete DatacenterOS that "regular" companies can just install.

There is a "missing OS" - a piece of software that you install on a large assembly of computers, and that transform this assembly of computers into "one computer".

What would a "Datacenter OS" consist of? If you look at modern tech stacks, you find that there is a surprising convergence - not in the actual software people are running, but in the "roles" that need to be filled. For each role, there are often many different available implementations.

The things you see in every large-scale distributed infrastructure are:

Some form of cluster-wide file system. Think GFS/Colossus if you are inside Google, GlusterFS or something like it if you are outside. Many companies end up using S3 because the available offerings aren't great.
A horizontally scalable key-value store. Think BigTable if you are inside Google, or Cassandra, or Scylla, or (if you squint enough) even ElasticSearch.
A distributed consistent key-value store. Think Chubby if you are inside Google, or etcd if you are outside. This is not directly used by most applications and mostly exists to manage the cluster.
Some sort of pub/sub message queuing system. Think PubSub, or in some sense Kafka, or SQS on AWS, or perhaps RabbitMQ.
A job scheduler / container orchestrator. A system that takes the available resources, and all the jobs that ought to be running, and a bunch of constraints, and then solves a constrained bin-packing optimization problem to make sure resources are used properly. Think Borg, or to some extent Kubernetes. This may or may not be integrated with some sort of MapReduce-style batch workload infrastructure to make use of off-peak CPU cycles.

I find it very worthwhile to think about "what other pieces do I have on a single-laptop-OS that I really ought to have on the DatacenterOS?".

People are building approximations of a process explorer via Prometheus and a variety of other data collection agents.

One can argue that distributed tracing (which everybody realizes they need) is really the Datacenter-OS-strace (and yes, it is crucially important). The question "what is my Datacenter-OS-syslog" is similarly interesting.

A lot of the engineering that goes into observability is porting the sort of introspection capabilities we are used to having on a single machine to "the computer".

Is this "service mesh" that people are talking about just the DatacenterOS version of the portmapper?

There are other things for which we really have no idea how to build the equivalent. What does a "debugger" for "the computer" look like? Clearly, single-stepping on a single host isn't the right way to fix problems in modern distributed systems - your service may be interacting with dozens of other hosts that may be crashing at the same time (or grinding to a halt or whatever), and re-starting and single-stepping is extremely difficult.

Aside from the many monitoring, development, and debugging tools that need to be rebuilt for "the computer", there are many other - even more fundamental - questions that really have no satisfactory answer. Security is a particularly uncharted territory:

What is a "privileged process" for this computer? What are the privilege and trust boundaries? How does user management work? How does cross-service authentication and credential delegation work? How do we avoid re-introducing literally every logical single-machine privilege escalation that James Forshaw describes in his slides into our new OS and the various services running there? Is there any way that a single Linux Kernel bug in /mm does not spell doom for our entire security model?

To keep the post short:

In my opinion, the emerging DatacenterOS is the most exciting thing that has happened in computer science in decades. I sometimes wish I was better at convincing billionaires to give me a few hundred million dollars to invest in interesting problems -- because if there is a problem that I think I'd love to work on, it'd be a FOSS DatacenterOS - "install this on N machines, and you have 'a computer'".

A lot of the technological landscape is easier to understand if one asks the question: What function in "the computer" does this particular piece of the puzzle solve? What is the single-machine equivalent of this project?

This post will likely have follow-up posts, because there are many more ill-thought-out ideas I have on the topic:

Security models for a DatacenterOS
Kubernetes: Do you want to be the scheduler, or do you want to be the OS? Pick one.
How do we get the power of bash scripting, but for a cluster of 20k machines?

Friday, August 14, 2020

My Twitter-Discussion-Deescalation Policy

Twitter is great, and Twitter is terrible. While it enables getting in contact and starting loose discussions with a great number of people, and while it has certainly helped me broaden my perspectives and understanding of many topics, it also has a lot of downsides.

Most importantly, Twitter discussions, due to their immediacy of feedback and the fact that everybody is busy, often end up in shouting matches where "learning from each other while discussing a topic" (the actual purpose of a discussion) is forgotten.

Most importantly: Twitter can be very repetitive, and it can be very difficult to convey the context for complex topics - and nobody has time to repeat all the context in each Twitter discussion.

Today, I am recovering from a migraine attack that coincided with my kid having a cranky night, and as a result, I cut a few Twitter discussions short. The people on the receiving end of this "short-cutting" may rightly feel slighted, so I am writing this blog post in preparation for future similar situations.

There are some topics (often related to security or economics) about which I have thought for a reasonably long time. Particularly for security, we're talking about a few decades of hands-on experience with a fairly obsessive work on the topic, both on the theoretical and on the practical side. Rooted in this experience, I sometimes make statements on Twitter. These statements may be in conflict with what other people (you?) may think, and we may engage in a discussion. It is possible, though, that we will reach a point in the discussion where my feeling is "oh, in order to convey my point, I'd now need to spend 25 minutes conveying the context necessary for my point, and I only have a few hours in my day after I deduct sleep and other obligations".

At this point, I need to make a judgement call: Do I invest that time? I also need to make the call without having the most important context: Does the other side care about understanding me at all?

So if we end up in a Twitter discussion, and I reply to you with a link to this blog post at some point, please understand: I have run out of time to spend on this Twitter thread, and I need to cut the discussion short because conveying the necessary context is too time consuming without knowing that this is actually desired, and that our discussion is a mutual learning exercise.

If you very much care about the topic, and about understanding the perspective I have, I will happily schedule a 25-minute video call to discuss in person, and will obviously make an effort to understand your perspective, too. My DM's are open, ping me and I will send you a calendly link.

Monday, May 18, 2020

My self-help guide to making sense of a confusing world

It has become painfully evident over the last decade or so that social media has a somewhat corrosive effect on "truth" and "discussion". There are a variety of reasons for this - many unidentified - but a few factors are:

For every opinion, no matter how bizarre, it has become easy to find a community with similar beliefs.
The discoverability of almost all information coupled with the shortening of attention spans allows people with strange beliefs to search for information at - at least if only glanced at for 15 seconds - may be interpreted to confirm their strange belief.
Algorithms that maximize engagement also maximize enragement -- if the algorithms show me content that draws me into a time-sink discussions with no benefit, they are "winning" (in terms of the metrics against which they are evaluated).
The social media platforms favor "immediacy of feedback" vs. taking time to think things through. Social media discussions often devolve into name-calling or rapid-fire quoting of dubious studies without any qualitative understanding - people quote papers and sources they never critically evaluated.

Aside from that, creating false but engagement-producing content has become a veritable industry. Attention can be monetized, so one can earn good money by making up lies that appear credible to some subset of the population. The quality of mainstream reporting has been caught up in this downward spiral.

The result of this is "fake news" and the "death of a consensus on reality"; strange conspiracy theories; and generally many many hours wasted. The problem cuts across class and educational layers; it is simply not true that "only the uneducated" fall prey to internet falsehoods.

Personally, I am terrified of believing things that are not true. I am not quite sure why; but to assuage my fears of misleading myself, I have adapted a number of habits to function as checks on my own beliefs.

By and large, I have found them very useful.

In this blog post, I intend to share my thoughts on how people mislead themselves, and the checks I have decided to apply to my own beliefs. I am not always successful, but this is the bar I try to measure myself against. My hope is that this is helpful for others; perhaps it can help reject false beliefs somewhere.

So let's begin with an underlying assumption I make:

People tend to believe in things that help them.

As a young man I believed that people try to understand a situation, and then form a belief based on that. This is not what I observed in my life. My observation is that people choose beliefs and systems of belief to fulfill a function for them.

My father is born in the 30s in Germany, and as a pre-teen and early teen, he got a front-row seat to watch all adults perform an ideological 180-degree turn in front of him. The question of "how do people adjust their beliefs" has always been important in my discussions with him.

My conclusion is that people are very good at identifying what they want, and what is beneficial to them. They also like to feel good about themselves, and about what they do. Given these constraints, people tend to pick beliefs and systems of belief that ...

... allow them to do what they want to do.
... allow them to reap benefits.
... allow them to feel good about themselves at the same time.

I alluded to this with the sentence "Everybody wants to be the hero of their own story" in my disclosure Rashomon post.

It is crucially important to be aware that belief systems have a functional role for those that believe them. This is why it can be so hard to "convince" anyone of the incorrectness of their belief system: You are asking the person to give up more than a false belief - often, you are asking the person to adjust their view of themselves as being less benign than they like to believe, or you are asking the person to adjust their view in a manner that would cast doubt on their ability to obtain some other benefit.

When I write "people" above, this includes you and me.

Being aware of the functional role of beliefs is hence important when you investigate your own beliefs (more on that later). Trying to believe what makes us feel good is the most fundamental cognitive bias.

So what am I trying to do to counter that bias? Here's my list of 7 habits:

Clarify your beliefs
Ask about the benefits of your beliefs
Seek out original sources
Examine evidence for your beliefs and competing hypotheses
What new information would change your beliefs?
Provide betting odds
Discuss for enlightenment, not "winning"

Habit 1: Clarify beliefs

It may sound odd, but it takes conscious effort to turn everyday diffuse "belief" into an actual clearly articulated statement. For me, nothing quite clarifies thoughts like writing them down - often things that appear clear and convincing in my head turn out to be quite muddled and unstructured when I try to put them to paper.

Asking oneself the question "what are my beliefs on a given topic", and trying to write them down coherently, is surprisingly powerful. It forces a deliberate effort to determine what one actually believes, and committing to that belief in writing (at least to oneself).

Habit 2: Ask about the benefits of your beliefs - "am I the baddie?"

Awareness of the functional role of beliefs is important when examining one's own beliefs. Feynman famously said about science that "the first principle is that you must not fool yourself and you are the easiest person to fool".

When examining my own beliefs, it try to ask myself: What benefits does this belief bestow on me? What does this belief permit me to do? How does this belief make me feel good about myself?

It is quite often helpful to actively try to examine alternative narratives in which one casts oneself in a bad light. Trying to maintain our positive image of ourselves is a strong bias; making a conscious effort at examining alternate, unflattering narratives can be helpful.

My wife and me sometimes jokingly play "mock custody battle" - a game where we jokingly try to portray each other as some sort of terrible person and reinterpret each others entire biography as that of a villain - and it is quite enlightening.

Habit 3: Seek out original sources, distrust secondary reporting

Both for things like presidential debates and for scientific papers, the secondary reporting is often quite incorrect. In political debates, it is often much less relevant how the debate went and what happened -- only a small fraction of the population will have witnessed the event. What really counts is the narrative that gets spun around what happened.

You can observe this during election season in the US, where as soon as the debate is over, all sides will try to flood all the talkshows and newscasts with their "narrative" (which has often been pre-determined prior to the debate happening - "He is a flip-flopper. He flipflops." or something along those lines).

Likewise, scientific papers often get grossly misrepresented in popular reporting, but also by people that only superficially read the paper. Reporting is often grossly inaccurate, and if you are an expert on a topic, you will notice that on your topic of expertise, reporting is often wrong; at the same time, we somehow forget about this and assume that it is more accurate on topics where we are not experts (the famous "Gell-Mann amnesia").

A friend of mine (scientist himself) recently forwarded me a list of papers that estimated COVID-19 IFR; one of them reported an IFR of 0%. Closer examination of the contents of the paper revealed that they examined blood from blood donors for antibodies; nobody that died of COVID-19 went to donate blood 2 weeks later, so clearly there were no fatalities in their cohort.

Nonetheless, the paper was cited as "evidence that the IFR is lower than people think".

A different friend sent me a "scientific paper" that purported to show evidence that tetanus vaccines had been laced with a medication to cause infertility. Examining the paper, it was little than assembling a bunch of hearsay; no experimental setup, no controlling, no alternative hypothesis etc. Examining the homepages of the authors revealed they were all strange cranks, peddling various strange beliefs. It was "published" in a "we accept everything" journal.

Acquiring the habit to read original sources is fascinating - there are bookshelves full of books that are quoted widely, mostly by people who have never read them (Sun Tzu, Clausewitz, and the Bible are probably the most common). It is also useful: Getting into the habit of reading original papers helps cut out the middle-man and start judging the results directly; it is also a good motivator to learn a good bit of statistics.

Habit 4: Examine evidence for your beliefs, analyze competing hypotheses

Once one's own beliefs are clarified in writing, and one has looked at the primary sources, one can gather evidence for one's belief.

While one does so, one should also make a deliberate effort to gather and entertain competing hypotheses: What other explanations for the phenomenon under discussion exist? What are hypotheses advanced by others?

Given one's own beliefs, and alternate hypotheses, one can look at the evidence supporting each of them carefully.

Two principles help me at this stage to discount less-credible hypotheses (including my own):

Occam's Razor: Often, the simpler explanation is the more likely explanation
Structural malice: Malicious individuals are pretty rare, but if an incentive structure exists where an individual can benefit from malice while explaining it away, the tendency is for that to happen.
Incompetence is much more common than competent malice. The Peter principle and Parkinsons law apply to human organisations.

After this step, I end up forming an opinion - looking at the various competing hypotheses, I re-examine which I find most credible. Often, but not always, it is the beliefs that I declared initially, but with reasonable frequency I have to adjust my beliefs after this step.

Habit 5: What new information would change my opinion?

John Maynard Keynes is often quoted with "When the facts change, I change my mind; what do you do, Sir?". It is worth examining what would be necessary to change one's beliefs ahead of time.

Given my belief on a subject right now, what new information would need to be disclosed to me for me to change my mind?

This is very helpful to separate "quasi-religious" beliefs from changeable beliefs.

Habit 6: Provide betting odds

This is perhaps the strangest, but ultimately one of my more useful points. Over the last years, I have read a bit about the philosophy of probability; particularly De Finetti's "Philosophical Lectures on Probability".

When we speak about "probability", we actually mean two different things: The fraction for a given outcome if we can repeat an experiment many times (coin-flip etc.), and the strength of belief that a given thing is true or will happen in situations where the experiment cannot be repeated.

These are very different things - the former has an objective truth, the second one is fundamentally subjective.

At the same time, if my subjective view of reality is accurate, I will assign good probability values (in the 2nd sense) to different events. ("Good" here means that if a proper scoring rule was applied, I would do well).

Beliefs carry little cost, and little accountability. Betting provides cost for being wrong, and accountability about being wrong.

This means that if I truly believe that my beliefs are correct, I should be willing to bet on them; and through the betting odds I provide, I can quantify the strength of my belief.

For me, going through the exercise of forcing myself to provide betting odds has been extremely enlightening: It forced me to answer the question "how strongly do I actually believe this?".

In a concrete example: Most currently available data about COVID-19 hints at an IFR of between 0.49% and 1% (source) with a p-value of < 0.001. My personal belief is that the IFR is almost certainly >= 0.5%. I am willing to provide betting odds of 3:1 (meaning of you bet against me, you get 3x the payout) for the statement "By the end of 2022, when the dust around COVID-19 has settled, the empirical IFR for COVID-19 will have been greater than 0.5%".

This expresses strong belief in the statement (much better than even odds), but some uncertainty around the estimate in the paper (the p-value would justify much more aggressive betting odds).

(Be aware that these betting odds are only valid for the next 4 days, as my opinion may change).

To sum up: Providing betting odds is a great way of forcing oneself to confront one's own strength of belief. If I believe something, but am unwilling to bet on it, why would that be the case? If I believe something, and am unwilling to provide strong odds in favor of that belief, why is that the case? Do I really believe these things if I am unwilling to bet?

Habit 7: Discuss for enlightenment, not "winning"

When I was young, my father taught me that the purpose of a discussion is never to win, or even to convince. The purpose of a discussion is to understand - the topic under discussion, or the position of the other side, or a combination thereof. This gets lost a lot in modern social media "debates".

Social media encourages participating in discussions and arguing a side without ever carefully thinking about one's view on a topic. The memoryless and repetitive nature of the medium allows one to spend countless hours re-hashing the same arguments over and over, without making any advance, and ignoring any thought-out arguments that may have been put in writing.

After few weeks after the "Rashomon of Disclosure"-post, a Twitter discussion about disclosure erupted; and I upset a few participants by more or less saying: "Y'know, I spent the time writing down my thoughts and arguments around the topic, and I am very willing to engage with anybody that is willing to spend the time writing down their thoughts and arguments, but I am not willing to engage in Twitter yelling matches where we ignore all the nuance and just whip up tribal sentiment."

This was not universally well-received, but refusal to engage in social media yelling matches and the dopamine kick that arises from the immediacy of the experience is an important step if we want to understand either the topic or the other participants in the debate.

Summary

This was a very long-winded post. Thank you for having read this far. I hope this post will be helpful - perhaps some of the habits can be useful to others. If not, this blog post can at least explain why I will sometimes withdraw from social media discussions, and insist on long-form write ups as a means of communication. It should also be helpful to reduce bewilderment if I offer betting odds in surprising situations.

Friday, March 20, 2020

Before you ship a "security mitigation" ...

Hey everybody,

During my years doing vulnerability research and my time in Project Zero, I frequently encountered proposals for new security mitigations. Some of these were great, some of these - were not so great.

The reality is that most mitigations or "hardening" features will impose a tax on someone, somewhere, and quite possibly a heavy one. Many security folks do not have a lot of experience running reliable infrastructure, and some of their solutions can break things in production in rather cumbersome ways.

To make things worse, many mitigations are proposed and implemented with very handwavy and non-scientific justifications - "it makes attacks harder", it "raises the bar", etc., making it difficult or impossible for third parties to understand the design and trade-offs considered during the design.

Over the years, I have complained about this repeatedly, not least in this Twitter thread:

https://twitter.com/halvarflake/status/1156815950873804800

This blog post is really just a recapitulation of the Twitter thread:

Below are rules I wrote for a good mitigation a while ago: “Before you ship a mitigation...

Have a design doc for a mitigation with clear claims of what it intends to achieve. This should ideally be something like "make it impossible to achieve reliable exploitation of bugs like CVE1, CVE2, CVE3", or similar; claims like "make it harder" are difficult to quantify. If you can't avoid such statements, quantify them: "Make sure that development of an exploit for bugs like CVE4, CVE5 takes more than N months".
Pick a few historical bugs (ideally from the design doc) and involve someone with solid vuln-dev experience; give him a 4-8 full engineering weeks to try to bypass the mitigation when exploiting historical bugs. See if the mitigation holds, and to what extent. Most of the time, the result will be that the mitigation did not live up to the promise. This is good news: You have avoided imposing a tax on everybody (in complexity, usability, performance) that provides no or insufficient benefit.
When writing the code for the mitigation, *especially* when it touches kernel components, have a very stringent code review with review history. The reviewer should question any unwarranted complexity and ask for clarification.
Follow good coding principles - avoid functions with hidden side effects that are not visible from the name etc. - the stringency of the code review should at least match the stringency of a picky C++ readability reviewer, if not exceed it.”

There are also these three slides to remember:

In short: Make it easy for people to understand the design rationale behind the mitigation. Make sure this design rationale is both accessible, and easily debated / discussed. Be precise and quantifiable in your claims so that people can challenge the mitigation on it's professed merits.