For one, no one is seriously contemplating a LIDAR-only system, the question is between camera+LIDAR or camera-only.
> Lidar just fundamentally can’t read signs, traffic lights or road markings in a reliable way.
Actually, given that basically every meaningful LIDAR on the market gives an "intensity" value for each return, in surprisingly many cases you could get this kind of imaging behavior from LIDAR so long as the point density is sufficient for the features you wish to capture (and point density, particularly in terms of points/sec/$, continues to improve at a pretty good rate). A lot of the features that go into making road signage visible to drivers (e.g. reflective lettering on signs, cats eye reflectors, etc) also result in good contrast in LIDAR intensity values.
It's like having 2 pilots instead of 1 pilot. If one pilot is unexpectedly defective (has a heart attack mid-flight), you still have the other pilot. Some errors between the 2 pilots aren't uncorrelated of course, but many of them are. So the chance of an at-fault crash goes from p and approaches p^2 in the best case. That's an unintuitively large improvement. Many laypeople's gut instinct would be more like p -> p/2 improvement from having 2 pilots (or 2 data streams in the case of camera+LIDAR).
In the camera+LIDAR case, you conceptually require AND(x.ok for all x) before you accelerate. If only one of those systems says there's a white truck in front of you, then you hit the brakes, instead of requiring both of them to flag it. False negatives are what you're trying to avoid because the confusion matrix shouldn't be equally weighted given the additional downside of a catastrophic crash. That's where two somewhat independent data streams becomes so powerful at reducing crashes, you really benefit from those ~uncorrelated errors.
"In the camera+LIDAR case, you conceptually require AND(x.ok for all x) before you accelerate."
This can be learnt by the model. Let's assume vision is 100% correct, the model would learn to ignore LIDAR, so the worst case scenario is that LIDAR is extra cost for zero benefit.
This is not going to be true for a very long time, at least so long as one's definition of "vision" is something like "low-cost passive planar high-resolution imaging sensors sensitive to the visual and IR spectrum" (I include "low-cost" on the basis that while SWIR, MWIR, and LWIR sensors do provide useful capabilities for self-driving applications, they are often equally expensive, if not much more so, than LIDARs). Camera sensors have gotten quite good, but they are still fundamentally much less capable than the human eyes plus visual cortex in terms of useful dynamic range, motion sensitivity, and depth cues - and human eyes regularly encounter driving conditions which interfere or prohibit safe driving (e.g. mist/ fog, heavy rain/snow, blowing sand/dust, low-angle sunlight at sunrise/sunset/winter). One of the best features of LIDAR is that it is either immune or much less sensitive to these phenomena at the ranges we care about for driving.
Of course, LIDAR is not without its own failings, and the ideal system really is one that combines cameras, LIDARs, and RADARs. The problem there is that building automotive RADAR with sufficient spatial resolution to reliably discriminate between stationary obstacles (e.g. a car stalled ahead) and nearby clutter (e.g. a bridge above the road) is something of an unsolved problem.
The worst case scenario is that LIDAR is a rapidly falling extra cost for zero benefit? Sounds like it's a good idea to invest into cheap LIDAR just in case the worst case doesn't happen. Even better, you can get a head start by investing in the solution early and abandon it when it has obsolete.
By the way, Tesla engineers secretly trained their vision systems using LIDAR data because that's how you get training data. When Elon Musk found out, he fired them.
Finally, your premise is nonsensical. Using end to end learning for self driving sounds batshit crazy to me. Traffic rules are very rigid and differ depending on the location. Tesla's self driving solution gets you ticketed for traffic violations in China. Machine learning is generally used to "parse" the sensor output into a machine representation and then classical algorithms do most of the work.
The rationale for being against LIDAR seems to be "Elon Musk said LIDAR is bad" and is not based on any deficiency in LIDAR technology.
If you're on a desert island and you have 2 watches instead of 1, the probability of failure (defined as "don't know the time") within T years goes from p to p^2 + epsilon (where epsilon encapsulates things like correlated manufacturing defects).
So in a way, yes.
The main difference is that "don't know the time" is a trivial consequence, but "crash into a white truck at 70mph" is non-trivial.
It's different because the challenge with self-driving is not to know the exact time. You win for simply noticing the discrepancy and stopping.
Imagine if the watch simply tells you if it is safe to jump into the pool (depending on the time it may or may not have water). If watches conflict, you still win by not jumping.
I was responding to the parent who said if you had to make a choice between lidar and vision, you'd pick lidar.
I know there are theoretical and semi-practical ways of reading those indicators with features that are correlated with the visual data, for example thermoplastic line markings create a small bump that sufficiently advanced lidar can detect. However, while I'm not a lidar expert, I don't believe using a completely different physical mechanism to read that data will be reliable. It will surely inevitably lead to situations where a human detects something that a lidar doesn't, and vice versa, just due to fundamental differences in how the two mechanisms work.
For example, you could imagine a situation where the white lane divider thermoplastic markings on a road has been masked over with black paint and new lane markings have been painted on - but lidar will still detect the bump as a stronger signal than the new paint markings.
Ideally while humans and self driving coexist on the same roads, we need to do our best to keep the behaviour of the sensors to be as close to how a human would interpret the conditions. Where human driving is no longer a concern, lidar could potentially be a better option for the primary sensor.
> For example, you could imagine a situation where the white lane divider thermoplastic markings on a road has been masked over with black paint and new lane markings have been painted on - but lidar will still detect the bump as a stronger signal than the new paint markings.
Conflicting lane marking due to road work/changes is already a major problem for visual sensors and human drivers, and something that fairly regularly confuses ADAS implementations. Any useful self-driving system will already have to consider the totality of the situation (apparent lane markings, road geometry, other cars, etc) to decide what "lane" to follow. Arguably a "geometry-first" approach with LIDAR-only would be more robust to this sort of visual confusion.
Everyone is missing the point, including Karpathy which is the most surprising because he is supposed to be one of the smart ones.
The focus shouldn't be on which sensor to use. If you are going to use humans as examples, just take the time to think how a human drives. We can drive with one eye. We can drive with a screen instead of a windshield. We can drive with a wiremesh representation of the world. We also use audio signals quite a bit when when driving as well.
The way to build a self driving suite is start with the software that builds your representation of the world first. Then any sensor you add in is a fairly trivial problem of sensor fusion + Kalman filtering. That way, as certain tech gets cheaper or better or more expensive and worse, you can just easily swap in what you need to achieve x degree of accuracy.
> ...just take the time to think how a human drives...
We truly have no understanding of how the human brain really models the world around us and reasons over motion, and frankly anyone claiming to is lying and trying to sell something. "But humans can do X with just Y and Z..." is a very seductive idea, but the reality is "humans can do X with just Y, Z, and an extremely complex and almost entirely unknown brain" and thus trying to do X with just Y and Z is basically a fool's errand.
> ...builds your representation of the world first...
So far, I would say that one of the very few representations that can be meaningfully decoupled from the sensors in use is world geometry, and even that is a very weak decoupling because the ways you performantly represent geometry are deeply coupled with the capabilities of your sensors (e.g. LIDAR gives you relatively sparse points with limited spatial consistency, cameras give you dense points with higher spatial consistency, RADAR gives you very sparse targets with velocity). Beyond that, the capabilities of your sensors really define how you represent the world.
The alternative is that you do not "represent" the world but instead have that representation emerge implicitly inside some huge neural net model. But those models and their training end up even more tightly coupled to the type of data and capabilities of your sensors and are basically impossible to move to new sensor types without significant retraining.
> Then any sensor you add in is a fairly trivial problem of sensor fusion + Kalman filtering
"Sensor fusion" means everything and nothing; there are subjects where "sensor fusion" is practically solved (e.g. IMU/AHRS/INS accelerometer+gyro+magnetometer fusion is basically accepted as solved with EKF) and there are other areas where every "fusion" of multiple sensors is entirely bespoke.
> Lidar just fundamentally can’t read signs, traffic lights or road markings in a reliable way.
Actually, given that basically every meaningful LIDAR on the market gives an "intensity" value for each return, in surprisingly many cases you could get this kind of imaging behavior from LIDAR so long as the point density is sufficient for the features you wish to capture (and point density, particularly in terms of points/sec/$, continues to improve at a pretty good rate). A lot of the features that go into making road signage visible to drivers (e.g. reflective lettering on signs, cats eye reflectors, etc) also result in good contrast in LIDAR intensity values.