Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Learning to See in the Dark (2018) (github.com/cchen156)
645 points by ksaxena on April 17, 2020 | hide | past | favorite | 170 comments


As a photographer, the comparison to "raw" results without color balance or noise removal seems somewhat deceptive. The effects visible in the video seem easy to quickly replicate with existing techniques, such as the "surface blur" filter that averages out pixel values in areas with similar color.

This happens at the expense of detail in low-contrast areas, producing a plastic-like appearance of human skin and hair, and making low-contrast text unintelligible, which is why it's generally not done by default.


The comparison is fair because it tries to automate expertise.

I'm sure you know exactly how much of which filter to apply for similar results. Laymen like ourselves will need a lot more trial and error. Their contribution here is to provide a push-button, automated mechanism.

I would have probably also tried something simple and given up due to the noise. So this is definitely interesting.


This is completely untrue.

What you are describing is usually called automatic tone mapping. This is basically noise reduction and possibly color normalization from brightening a dark image. Them showing their black image as the starting point is silly, because jpg will make a mess of the remaining information. What they should show is the raw image brightened by a straight multiplier to show the noisy version that you would get from trying to increase brightness in a trivial way.


What jpg? they are using raw data.


Image on Github is JPEG made from RAW. Since RAW file has more dynamic range and contains a lot more information than JPEG you can take that photo in an editor and crank up the brightness. You will get a noisy image but it will be a lot brighter and will probably resemble the image with the high ISO in the middle. Then in an editor you can apply some de-noiser to get results similar to the last one.

So presumably this neural net more or less does it for you.


The *PNG is there just to show the results produced by the CNN, if you watch the linked video they do exactly what you are suggesting and then compare both results.


Their example on their github page uses a jpg that makes it look like they are creating something from nothing.


To me the results seem vastly superior to those sort of simple DSP algorithms. The video shows a comparison with some denoising: https://youtu.be/qWKUFK7MWvg?t=102


Your example strikes me as the kind of thing neural networks are much better at than a fixed filter. You or I could easily identify regions of an image where it's safe vs unsafe to do the surface averaging, and boundaries where we wouldn't want to mix up the averages. (For example, averaging text should be fine, so long as you don't cross the text boundaries.) A CNN should also be able to learn to do this pretty easily.


What you are describing is a class of filters known as edge preserving filters. You can look at bilateral filters and guided filters for examples that have been around for decades at this point.


So we can do a decent job with hand designed filters... Why aren't they in use in the problem the parent describes? Are they not good enough to deal with small text boundaries?

A lot of hand built filters (I see a lot of these in the audio space) have many hand tuned parameters, which work well in certain circumstances, and less well in other circumstances. One of the big advantages of NN systems is the ability to adapt to context more dynamically. The NN filters can generally emulate the hand designed system, and pick out weightings appropriate to the example.


This is effectively noise reduction, which bilateral and guided filters are actually used for. They take the weights of their kernels based on local pixels and statistics. You can also look up other edge preserving filters like BM3D and non-local means.

I don't know what you mean by hand made filters and I don't know why that's a conclusion you jumped to.


>As a photographer, the comparison to "raw" results without color balance or noise removal seems somewhat deceptive.

Huh? At 1:40 in the video that's exactly what they do.


Interestingly, this effect is notably visible in their example image [0]. Notice the distinctly "plasticized" appearance of the book cover, and how the text is not intelligible in the low-contrast areas of the reflection.

[0]: https://raw.githubusercontent.com/cchen156/Learning-to-See-i...


Note (a) and (b) are separate photographs (different angles and everything), and that (c) is based on (a), not (b); comparing the glare between (b) and (c) isn't quite an even comparison.


Oh gosh. Taking one second to think about it, _of course_ (a) and (b) are separate photographs -- that is the entire point of that diagram. Somehow my brain farted right over that when making my previous comment.

Thank you, not only for setting me straight, but also for doing so as kindly as you did.


It would indeed be interesting to see a comparison with for instance non-local means on the scaled raw image. The speed is superior in any case, I suspect.


i always have this complaint too. its fundamentally a lossy process, in the hand wavy sense. its more "impressive" looking, but actually conveying less real detail.


Hi, I'm a developer at NexOptic[0] and we are a company that was deeply inspired by this paper when it was first published. We had a lot of early success when attempting to replicate the results on our own and ended up running with it, and extending it into our own product line under our ALIIS brand of AI powered solutions.

For those curious, our current approach differs in some very significant ways to the author's implementation, such as performing our denoising and enhancement on a raw bayer -> raw bayer basis with a separate pipeline for tone mapping, white-balance, and HDR enhancement. As well, we explored a fair amount of different architectures for the CNN and came to the conclusion that a heavily mixed multi-resolution layering solution produces superior results.

As other commentators have pointed out, the most interesting part of it is really coming to terms that, as war1025 pointed out, "The message has an entropy limit, but the message isn't the whole dataset." It is incredibly powerful what can be accomplished with even extraordinarily noisy information as long as one has a extremely "knowledge packed" prior.

If anyone has any questions about our research in this space, please feel free to ask.

[0] https://nexoptic.com/artificialintelligence/


It would be really cool if you could feed the network a photo with flash that it could use for gathering more information, but then recreated a photo without flash from the non-flash raw.

Often flash is not the look people are going for, but would be okay with the flash firing in order to improve the non-flash photo.


Absolutely! We recently rebranded our AI solutions from ALLIS (Advanced Low Light Imagine Solution) to ALIIS (All Light Intelligent Imaging Solution) specifically because we are beginning to branch out to handle use cases such as this!

As a proof of concept that this task can be tackled directly, a quick search brought up "DeepFlash: Turning a Flash Selfie into a Studio Portrait"[0]

Beyond denoising, we are already running experiments with very promising results on haze, lens flare, and reflection removal; super resolution; region adaptive white balancing; single exposure HDR; and a fair bit more.

One of the other cooler things we are doing is putting together a unified SDK where our algorithms and neural nets will be able to run pretty much anywhere, on any hardware, using transparent backend switching. (e.g. CPU, GPU, TPU, NPU, DSP, other accelerator ASICs, etc..)

[0] https://arxiv.org/abs/1901.04252


Before reading your reply to OP's comment I got to thinking about how the super-resolution process and flash photography might interact (https://news.ycombinator.com/item?id=22905317). I get the impression you left the point I got to a long time ago :)


DeepFlash: Turning a Flash Selfie into a Studio Portrait

https://www.youtube.com/watch?v=enLmReROhc8


The way I mistakenly initially parsed this comment gave rise to a potentially-dumb idea/question:

What would happen if you

- begin capturing video (unsure of fps) on a phone-quality sensor in a near-dark environment

- pulse the phone's flash LED(s) like you're taking a photo

- do super-resolution on the resulting video to extract a photo...

- ...while factoring in the decay in brightness/saturation in consecutive video frames produced by the flash pulse?

I vaguely recall reading somewhere that oversaturated photos have more signal in them and are easier to fix than undersaturated. Hmm.

IIRC super-resolution worked with 30fps source video for better quality; I wonder if 60fps or 120fps source video would produce better brightness decay data, or whether super-resolution could actually help extract more signal out of the decay sequence too.

On the other hand, I'm not sure if super-resolution fundamentally requires largely consistent brightness in order to work as well as it does. :/

Perhaps individual networks could be trained/tuned to specific slices/windows of the brightness gradient. I also wonder if it would be useful to factor the superresolution process into each of the brightness-specific stages or just to do it at the end.


For the most part, our effort has been focused on single exposure image enhancement, however we are beginning to use recurrent models to improve quality when video information is available.

Nonetheless, it's kinda a neat idea, so I tried testing the feasibility of it. I set up a recent flagship phone that claims to have 960fps super-slow-motion video capture next to another phone with a strobe app at 12Hz with a short delay in between pulses.

https://www.dropbox.com/s/ha51ntucl3klkcb/cell_flash_960fps....

There are definitely a few frames where the LED is at an intermediate brightness, however teasing out the exact timings between the flash and the camera may prove to be difficult to correctly synchronize.

As for over-saturated images having more signal... although the PSNR calculation may give you a better number, in practice, a region that is over-saturated is just a blob of 1s on the image (assuming float64 pixel values of 0-1) and there is no information there to extract. With a black level near but not at 0, we've found there is often more information hidden in the 'dark noise' than can be discerned by the human eye alone.


Wow, cool, you actually tested it! And an effective test too.

Stepping back and forth throughout the frames (using mpv), the flash clearly enhances several spots of localized brightness where contrast pops out into clear relief.

The effect is clearest at the very bottom of the image which goes from "shadow blob" to "adequately discernible", but I think the area just above that (the 3rd vertical quarter of the image) is most interesting; the detail visible in frames 24-29 (immediately before 00:00:01 / 30.030fps) is excellent, and that's with the flash LED at peak brightness.

Flash synchronization would be effectively impossible to achieve (the camera would need to stream LED status information inside each frame), but achieving such synchronization may provide no net gain, even with "LED is on" information available, both because the exact point the hardware says "LED is off" will not necessarily correspond to the exact moment in time the light decays to zero (based on 1/960 = 1.0416 milliseconds per frame, the video suggests it takes apparently 2 frames or ~2.08 milliseconds for the light to decay), which will never be the same as the flash sends light outwards into arbitrarily different environments. I can't help but wonder if calibration references for everything from Vantablack to mirrors would be needed... for each camera sensor... and that there would then be the problem of figuring out which reference(s?) to select.

Staring at the video frames some more, two ideas come to mind: 1), analyzing all the frames to identify areas of significant difference in brightness, then 2), for each (perhaps nonrectangular) region of difference, figuring out the "best" source reference for that specific region. As an example reference, I'd generally use frame 13 for most of the image, and frame 44 or so (out of many, many possible candidates) for the bits that, as you say, become float64 1.00 :). Obviously a nontrivial amount of normalization would then be needed.

I'm not aware of how you'd do either of these neurally :) but the idea for (1) came from https://en.wikipedia.org/wiki/Seam_carving (although just basic edge detection may be more correct for this scenario), while the idea for (2) came from https://github.com/google/butteraugli which "estimates the psychovisual similarity of two images"; perhaps there's something out there that can identify "best contrast"? I'm not sure.

Trivial aside: I wondered why mpv kept saying "Inserting rotation filter." and also why the frame numbers appeared sideways. Then I realized the video has rotation metadata in it, presumably so the device doesn't need to do landscape-to-portrait frame buffering at 960fps (heh). I then realized the left-to-right rolling shutter effect I was seeing was actually a bottom-to-top rolling shutter. I... think that's unusual? I'm curious - after Googling then reading (or, more accurately, digging signal out of) https://www.androidauthority.com/real-960fps-super-slow-moti... - was the device an Xperia 1?

(And just to write it down for future reference: --vf 'drawtext=fontcolor=white:fontsize=100:text="%{n}"' adds frame numbers to mpv. Yay.)


Sounds like you have taken this pretty far, do you have any example outputs? The only one I found via your website was a PDF with a low res image with no context.


Sure, we have a short deck[0] that gives an intro to our noise reduction, and also here is a folder[1] that shows off a calibration target we captured with a actual camera (20ms, f22) in low-light conditions: (original, 100x gain, 100x gain + ALIIS)

We also have some more raw data[2] where there is the original bayer data available as .npy files with 40db analog gain applied, however I think the calibration targets show off what we are able to accomplish more dramatically. Finally, we have a short youtube video[3] that shows off how it works when applied to video.

[0] https://www.dropbox.com/s/0bm4dpxhn35vkhe/ALLIS_Investor_Int...

[1] https://www.dropbox.com/sh/k861saentyq1cs6/AADmO7X_L49nUkEI_...

[2] https://www.dropbox.com/sh/fv8omdf4fbx59m9/AABDnf6sdvv7rtIml...

[3] https://youtu.be/99Cq1bWCmMM


It's surprising how little code [1] is needed to do this. On the other hand I feel this is quite dependent on the specific camera models and might not work on the RAW data downloaded from my phone. Happy to be corrected.

[1] - https://github.com/cchen156/Learning-to-See-in-the-Dark/blob...


It's a huge amount of code, hidden behind the tensorflow import statements. It's common to credit GPUs for the rapid spread of deep learning, but good GPUs were available for quite a few years before deep learning really took off. As someone who wrote * a lot* of OpenCL code, including my own python wrappers, I'm fairly certain this code would be thousands of lines without a computation graph framework library. These frameworks are really amazing pieces of software engineering and deserve some non-trivial fraction of the credit for the rise of deep learning.

If you want to know what the next hot thing in software engineering will be, just pay attention to whatever Jeff Dean is doing.


I don't know that I agree with this first statement, but even if I do, everything is abstracted by import statements even outside ML. You say this is a huge amount of code abstracted, but it wouldn't be difficult to reimplement this in numpy and pandas directly without using tensorflow at all. The code would expand a bit, and you'd have to deal directly with backprop and calculating derivatives but it wouldn't expand things too much. But then you could make the same claim about numpy abstracting the linear algebra, and I could show you that I could extract that and do it without numpy but then it would be the python math library. It's turtles all the way down. My point is, your comment applies to everything.


Yup, I absolutely agree. Almost all big leaps in software engineering and applied computer science come from building a powerful and simple abstraction. Powerful and simple abstractions are surprisingly difficult to get right.


When people say "little code" what they mean an should say is "little customization of the tool".

There's also the issue of how hard it is to select the line of code (API call) that does the job (because the API surface is huge), which is nearly invisible attribute.

Mathematica is famous for being incredibly powerful with low custom coding, but also very hard to find the API call that does the precise thing you need.


We don't have good GPUs that is as fast until 580 (or to some extents, the first Titan). Previous generations only about 2 to 5x faster depending on what types of CPU you compare against.

IMHO, credit should always go to Alex Krizhevsky for the rapid spread of deep learning. He has shown us it was possible. Even without Tensorflow and PyTorch, we will be fine with Caffe, torch, mxnet or Julia.


[flagged]


I implemented neural networks before the advent of the good python frameworks. It sucked. And CNNs existed for decades before AlexNet. Honestly, the software and hardware engineers are the real heroes of the deep learning revolution.

By way of analogy, David Heinemeier Hansson didn't invent the webapp or even the MVC design pattern. But Ruby on Rails did change the way webapps were built, and enabled a bunch of stuff that wouldn't have been possible otherwise (or, at least, would've taken longer and been more expensive). Lots of websites were built because of Ruby on Rails that probably wouldn't have been built otherwise, even if people would've kept on doing the web thing regardless. We can say the same thing about lisps, about linux, and about a lot of other software infrastructure.

Any high schooler who's capable of learning python and can afford a gaming desktop can build and train a neural network. That's pretty amazing, and definitely wouldn't be the case without computation graph frameworks.

I've never worked for Google or with Jeff, and I'm not a huge fan of the ad tech industry, although I don't understand why either of those things should matter.


"How can I train the model using my own raw data?

Generally, you just need to subtract the right black level and pack the data in the same way of Sony/Fuji data. If using rawpy, you need to read the black level instead of using 512 in the provided code. The data range may also differ if it is not 14 bits. You need to normalize it to [0,1] for the network input."

The Sony and Fuji training code looks mostly the same - they haven't bothered to pull out common code and re-use.


This could actually be shortened, maybe simplified, significantly. For example there is a lot of redundancy in the layers and that could be pulled out into a function per block. This is what I often do with deep networks as it helps avoid errors in code and shortens everything significantly, at the potential expense of being able to grok it initially as quickly.

But, many DNN concepts (and ML concepts themselves) can be described with a few lines of pseudocode. CNNs, RNNs, etc. can all be described in a few lines.

It's really quite amazing, most of the work goes into first creating the net work from theory, then training and tuning it until you get good results.


Do you have an iPhone 6s? https://youtu.be/qWKUFK7MWvg?t=85


A "Two Minute Papers" on this project from 2018: https://www.youtube.com/watch?v=bcZFQ3f26pA


The problem with techniques like this is that they fundamentally amount to ‘making a plausible guess as to what the image would look like’, since essentially they can’t extract information that is simply not there. There is a Shannon entropy limit here.

Machine learning is really machine-enhanced educated-guesswork, which has its place but also has its limits.


It's more than 'good enough' for most purposes. Matching other shots for Hollywood quality, probably not. For surveillance or the like it's fine. The things it's guessing poorly about are textures or colors.

Being able to read the title on the books in the example photo is great; you could rely on the title for evidentiary purposes, the smaller text probably not so much. So for a security camera it would do poorly at identifying the color of a car, but might well be sufficient to read the license plate.


Reading a license plate seems like precisely the kind of circumstances where spurious 'plausible interpretation' of limited data can cause trouble.

You show in a courtroom a CNN-enhanced low light image of a car and it's there, literally 'clear as day' - the jury will find it pretty compelling. But maybe the data really wasn't there in the original image, and the CNN just filled in some blanks based on previous images of license plates, letters, and just random noise it had seen in the past.

The worry is when these kinds of algorithms get built in to basic image capture processes, so you never even see the raw data, only data that has already been filtered through the inbuilt prejudices of the CNN enhancement suite.

The camera never lies, but now it doesn't have to, because it can convince itself it saw something that wasn't really there...


Of course, but what are the odds that the algorithm just lucked into the correct book title and other cover text? It doesn't have a dictionary or semantic network.

You are right that the raw sensor data should always be preserved. But sticking with the license plate example, you could challenge a picture of a single car with a visible license plate far away in a wooded area, but it would be hard to refute a picture of the same car in a parking lot surrounded by other (non-suspect) vehicles whose presence there at the same time could be independently verified. In other words, if I can show that it accurately read the license plates of 9 other cars, the chances that it got yours wrong go way down.

That's assuming a single photo taken in the dark by an investigator. With a fixed security camera you would have an even larger basis of comparison, with a population of hundreds or thousands of license plates against which to rate it. I predict that before long we'll see preemptive certification for devices warranting the reliability of their image pipeline out to a certain distance at either the manufacturing or installation stage.


Xerox used to replace numbers in documents while copying:

https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_m...

License plates are an ideal breeding ground for false enhancement owing to standardisation of appearance; an ML algo trained on lots of examples might, without due care, learn to replace as a well-known texture.


The pre-emptive certification I mention would be a validation of due care. It doesn't matter how many theoretical arguments you want to throw up against this, once there's sufficient empirical evidence for its reliability (and there will be) it will be accepted as evidence.

Also, y'all need to think more like prosecutors. Say you are dragged you into court on the basis of photos showing your car in the dark, and you object that the photo is from the ML 9000 security camera and it might be just imaging your license plate. The police/prosecutors will just 'borrow' your car and leave it there for a night and leave it up to the jury.

Forensic evidence can be and is regularly abused, but it can also be quite easily validated and it's massively persuasive to juries.


I don’t understand the scenario you’re picturing.

Let’s assume I’m innocent but some neural net has placed my car at the location of a crime.

You’re saying that if I challenge the evidence, the prosecutors will counter that by showing that if my car were there, the neural net would have produced a picture of my car? They don’t need to do that and it adds nothing to their argument. I’m not challenging that the neural net is capable of producing an image of my car.

No, the point is I am placed in the position of having to demonstrate that there exists some other car which under those lighting conditions the neural net would mistake for mine. That’s a far harder burden of proof for me to reach.

Honestly this is similar to the way fingerprint, DNA and hair sample matches are presented to courtrooms all the time so it isn’t a new problem. As you say, forensics are persuasive.


I personally think juries are a great way to convict innocent people, and that adversarial court systems privilege people who can afford to pay for the best storyteller, so arguments from that direction start out hobbled.


Using these processed images for evidence purposes sounds like precisely what the parent comment is concerned about.


Agreed. It gets to an important point of the purpose of the photo. Photo as a record versus photo as an aesthetic piece. This hurts the photo as a record but improves the photo as an aesthetic piece. This would be a bad addition to a security camera, but perhaps a good addition to an instagram pipeline. There are plenty of other issues there, like is it good/healthy for stuff like instagram to be diverging away from records, creating unrealistic (or perhaps literally unreal) expectations. We're taking what are slowly inching closer to imaginative art pieces and presenting them as records.


Photographs have never been faithful records. The map never the territory. There are always judgement calls. The whole concept of JPEG is to throw away information.


Nothing's a perfect record, but a thing's purpose can be to serve as a record. Like if you stick seismometer readings into an academic paper, everyone will know there's noise, but that's okay. You did your best. But there are some changes we can agree are bad there. Now, if you stuck your seismometer into an autotuner and sync'd it with led zeppelin, that's okay. Just not if you're putting that into your paper.

For a long time, photographs were typically used as records. Even when they were art, they were typically records of something. Soon, we'll be typically doing so much more with them and will have to accept that photos can, but frequently won't serve as records.


Counterpoint: The human brain converting a 2D image to a 3D model is educated-guesswork too :)


This hits on an interesting point.

There is an entropy limit to the message, but the message isn't actually the only data.

One thing humans are great at is integrating existing knowledge into a messy situation and intuiting more than is available just from the raw message.

I.e. The message has an entropy limit, but the message isn't the whole dataset.


Yes and that's what this "lossy" conversion to daytime does as well, incorporate prior knowledge, but that prior knowledge is about how images of real world things function during night versus day.


This doesn't quite match Information Theory.

First of all, if you were to use the image as a communication channel, how much you could theoretically communicate is exactly the entropy of messages (by definition), and optimal communication means maximum entropy. Information theory already assumes shared existing knowledge in the form of codes; the codes essentially encode all this knowledge, which you could make an analogy in images to digits and shapes, etc. -- what makes them decodable is there is statistical redundancy, shapes do not occur arbitrarily (i.e. not every possible shape occurs, at least not with equal probability) and exhibit dependence between different pixels of a shape and even between shapes elsewhere in the image. Again the dependence is (generally) based on the statistics of the distribution of all possible images -- it essentially encodes all prior knowledge.

This redundancy allows reconstruction of losses in on part of the image from data elsewhere. It's the same principle used in error correcting codes, except codes are designed, while shapes are mostly natural (except things like alphabets, which are designed and indeed follow some principles of codes). But because they're not designed there's not guarantee of having a unique/reliable decoding (i.e. you can get a distribution).

I think that's a significant issue, because if your estimate doesn't match reality it could have important consequences for the use of the image: perhaps text goes from 'X is good' to 'X sucks'.

In this case a few things could be done:

1) Have some kind of watermark indicating the image was enhanced by a neural network, and possibly contains false information;

2) Have some kind of indication of reliability of the image: it should encode the multimodality/confidence of the decoding distribution -- how many different solutions does this have. If it is more or less unique, it would show as high confidence; otherwise it would show a low confidence indicator;

3) Instead of trying to convey uncertainty, the system could simply give up in cases where there is too much uncertainty, i.e. leaving the image dark. This could be done locally or globally, although locally it introduces a lightning consistency problem.

---

There's another important observation w.r.t. Information Theory/Statistics: it essentially assumes unbounded computational power (since this distribution could require analyzing arbitrarily large datasets). Of course this isn't true in reality. For example, the entropy of an encrypted of a redundant text is exactly the entropy of the plaintext string plus the entropy of the key (given an encryption ensamble or encryption prior) -- the process of (e.g. through brute force) finding the key doesn't concern statistics. However, with reasonable computational power, the (properly) encrypted stream is indistinguishable from a random string, hence it would have maximum entropy. So there are further computational limits beside statistical limits. In the case of encryption the function is again designed (to be not tractable), while in natural images the correlations are of simpler and hopefully more tractable nature (although I'm sure not always the case).


The courts & hopefully the jurors should be aware that humans are fallible & capable of lying, but may have hard time believing that cameras can lie as well.


The courts are terrible at that.

Eyewitness testimony is awful but given gold status.


Two 2D Images. The brain doesn't have to guess much when it can use the parallax effect created by both eyes. That's why quite a few animals have two eyes. And I believe that's why we have two ears, too.


Yes, therefore it is better to turn on the lights!


And you can see some of its biases in the results, yeah. Look at the thick book's spine on the right, and compare the box around the title - "our result" has pretty significant staircasing instead of being a slightly-off-vertical line.


I'm also not a fan of how the only part actually readable in the (a) original, which is part of the title in the front book, becomes completely whitewashed in (c). Where the model actually had the most information, it completely removed it in the result...


But wouldn't that level of glare be what would actually happen if you took the original image, and shot it in the amount of light required to make it look like the output image?

It's not trying to make things readable; it's trying to make things look like there was more light in the room when they were shot. In rooms with high lighting, some objects have glare. That's "correct"—it's what would appear in the training data.


> the model actually had [...] information

Wait, did it? Isn't the middle photo being shown for comparison only, rather than as an input?


I guess I might have misinterpreted the goal. If the goal was to make the image look like it was daylight, then maybe whitewashing that light reflection was the correct choice. If the goal was to "see in the dark", then it seems like a very bad choice.

EDIT: Finally got the paper to load via the helpful wayback machine link provided in another thread. It looks like the goal was to simulate a long exposure with a short exposure. So whitewashing of "bright" areas in the original might be expected.


Regardless of the goal, there's no way to get a more readable result if the data just isn't there. Whitewashing might simply be a result of that absence.


My point was that it whitewashed exactly those areas which had the most information. However, this is inline with their stated goal of mimicking a long exposure. It's not inline with "seeing in the dark", but that's not their goal.


If you insist on the same point, you didn't understand my initial reply. Again, I'm assuming the input is (a), not (b). But maybe you mean the lighter areas have the most information. If so, why? Just because there's more light? More light != more information. It could just be a bunch of noise.


> More light != more information. It could just be a bunch of noise.

You do realize that, in a camera sensor, light is the signal, right? So the more light, the higher the signal-to-noise ratio, which means that yes it does have more information available to extract.

And yes, I quite realize that the input is (a). I'm guessing that in your display you are not seeing that there is a brighter spot in the middle of (a) corresponding to the whitewashed area in (c). Try maxing out your brightness if you're on a phone or laptop and you should see it. I can even make out letter shapes in (a) within this bright spot.


>‘making a plausible guess as to what the image would look like’

people bring this up all the time as hot-takes in these areas. it's conditional inference. it's no more disingenuous than linear regression.


It's a general principle of information theory that, "to make inferences, you have to make assumptions".


It's a great result, but it's not perfect. No need for the "perfect" hyperbole in the title.




if it's anything like my old college webserver.. too much traffic/bandwidth consumed



It is back now.


The State of Illinois is out of money again.


You shouldn't be downvoted - with a big recession/depression looming, link rot and many sorts of repositories shutting down are a big issue that will slow down the pace of research.



I was just wondering a couple days ago why the image from my phone is so grainy, while my eyes+brain can see everything clear in the dark (it wasn't completely dark, of course).

This seems to replicate the post-processing we do in our brain (which is also a giant neural network). I wonder if the process is similar?


That's not really a good analogy. You have a totally different sensor chemistry in your eyes, as well as different processing.


And while brains are the original neural networks, they don't resemble what's going on with ML DNNs at all.


Low-level visual processing in the cortex share striking similarities with CNNs, actually. Gabor filters etc.


Very small numbers of photons (1) are required to trigger rhodopsin cycle. So primary receptor itself is very very VERY sensitive.


To be clear, the parent did not fail to include a citation. The parenthetical note is that rod cells are so sensitive that they react to being struck by a single photon.



Your brain doesn’t make a 2 dimensional image based entirely on photons entering your eye. You generate a complex physical model of your surroundings based only partially on visual input and rely substantially on memory.


Also other senses, including proprioception. In a completely dark environment, you could swear that you see your hand waving in front of your face. That's because your brain actually does know it's there, and it's trying to create a unified model.


Kind of like this well trained CNN is no longer relying entirely on the raw pixel values, but is statistically inferring a brighter image from the baseline.


There's a difference between applying known priors, and making things up based on statistics. Conflating the two isn't helping anyone.


Not to harp on this, but the point is that, as I understand it, both “systems” are using exogenous information to extrapolate more data than is actually present in the source image.

That’s not to say that the same “thing” is happening at the granular level at all.

But this is distinctly different from standard filtering functions, which can only work with entropy already present in the source image. So there’s a neat distinction.

The output from the CNN is essentially an “artist interpretation” of the source image. As such there could be “clarifying details” in the output that were in fact totally invented and not actually present in the source.


The dynamic range of the human eye is better than that of the lens in your phone. https://en.wikipedia.org/wiki/Human_eye#Dynamic_range:

”The human eye can detect a luminance range of 10¹⁴, or one hundred trillion (100,000,000,000,000) (about 46.5 f-stops), from 10−6 cd/m2, or one millionth (0.000001) of a candela per square meter to 10⁸ cd/m2 or one hundred million (100,000,000) candelas per square meter. This range does not include looking at the midday sun (10⁹ cd/m2)[21] or lightning discharge.”


Also, our eyes are better. So far.


I wonder what things would be like when phone cameras are as good as human eyes, the climax of consumer photography?


Pretty cool but seems like there’s a big limitation on this for now

“ The pretrained model probably not work for data from another camera sensor. We do not have support for other camera data. It also does not work for images after camera ISP, i.e., the JPG or PNG data.”

Would be cool to see how they come up with better models that would allow them to overcome the above limitations


I doubt this image is showing the true raw data (a):

https://github.com/cchen156/Learning-to-See-in-the-Dark/blob...

If you take the dark image (a) from that and balance its color, the information that is present in it simply cannot contain the text from the book covers and so on. In fact, it's full of JPEG artifacts despite the image being a PNG. It would be useful if they presented a histogram equalized image of (a).



I always wondered if you can "trust" an image that has been basically recreated. Could that kind of image be used as an evidence in court?


Xerox copiers had a bug caused by a failed image (re)construction, which caused it to replace similar (but not identical) parts of an image with other pieces of the image.

http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...?


If you could show a basic level of consistency. Take the correctly rendered title on the book vs the incorrect colors; the odds that it got lucky with the text instead of a different title or a book with random letters on the cover are negligible. But if your evidence revolved around the color of the book the villain stole from the library, not so much.

So if you're planning to do crime, make choices where the evidence relies on spectra rather than geometry. Steal Rothkos rather than Mondrians; baggy coveralls are in, form-fitting ninja wear is out.


It would be good to get a comparison of a brightened version of the sample image compared with the CNN version. Right now the sample image just looks black, but if you scale up the brightness you get an image that looks more like the higher ISO image. That would be a better comparison since it shows what improvements the CNN gives over naive techniques like just bumping up the pixel values.


I wonder if photographic evidence "enhanced" by such a method would be admissible in court?


Perhaps it depends on "which court". The legal systems in some parts of the world would allow it.


Some questions:

- Did they create a special network topology for this problem?

- Does the network need to see the entire image, or only an NxN subblock at a time?

- How did they obtain the training data? Is it possible to take daylight images and automatically turn them into nighttime images somehow?


Take pictures at night with a tripod mounted camera with different exposure brackets?


The problem is the amount of pictures you'd need. It's much easier to use available datasets if you know how to preprocess the data.


Could something like this be done for night vision goggles or is there significant latency?


An interpolation that looks for movement of a few anchor points? I imagine that would entail much less computation and so deliver apparent real-time night vision. Though sudden big movements in scene would cause blackout regions of about a second?


Interestingly, this is effectively an extreme version of solving the colorization of black & white photos problem. I wonder what the results would be if you just threw some black and white photos into the model.


(2018) and a better title that doesn't say CNN, c'mon


We've added the year above. Submitted title was "CNN converts night images to perfect daylight in – 1s".


isn't this just some tweaked raw processing algorithm?


I think I'll faint seeing AI progress anymore.

I didn't even think this was possible. Have people ever done this manually before? Like without AI?


Cool and integrating this into AR Glasses would make them almost a must buy! Turn night into Day ..see in the dark, etc!


What would happen if this were paired with license plate recognition? And would it be admissible as evidence?


I believe this should have a (2018) tag.


Yes, the result image and videos are from 2018 https://github.com/cchen156/Learning-to-See-in-the-Dark/blob...


Just wondering why for something brand new they'd use python 2.7??


Probably because it was already installed on their machines. Also what benefit would this project get by using a newer version?


Longevity.


Funny. I'm walking down the corridor almost in total darkness trying to get my son to sleep. I get bored and with my free hand reach to my phone, open NH and stumble upon this title. Totally unrelated to its content but I had a (quiet) laugh :)


I wonder, how much can it improve over this: https://youtu.be/c_0s06ORTkY

X27 is also using some kind of neural algorithm to denoise and get maximum out of the CIS


What camera are they shooting at 409,600 ISO at?


In the video they reference the Sony A7S II, on Sony's website[1] they claim:

>Still images: ISO 100-102400 (expandable to ISO 50-409600),

[1]:https://www.sony.co.uk/electronics/interchangeable-lens-came...


Which is extremely lossy, because any ISO other than the sensor's native level is the result of in-camera processing. Unlike film, adjusting the "ISO" in a digital camera doesn't increase sensitivity; that's physically impossible. Instead, very strong overgain processing is applied.

So in this instance they're processing lossily on top of an image already processed lossily in-camera.


The A7g image is used for comparison rather than input data, as that camera is widely regard as the state of the art for low-light photography on a non-scientific/military budget.


The Sony a7s and a7sii have a native iso range of 100 to 102400, which can be boosted to 409600.


why use iso 8,000 as input and not the camera native ISO ?


Now all they have to do is make sure the correct image displayed for the story


Impressive of the American news channel, CNN, to convert images in minus one second.


Please don't do this. What would be a fun crack in person becomes a plastic garbage patch here.


Apologies dang, I’ll keep this in mind in future :)


Appreciated!


CNN here is a "Convolutional Neural Network"


The title needs to be changed so brains recognize it as such. It either needs a preceding adjective or letter indicating what type of convolutional network it is.

The other option is spelling it out.

Most people will read CNN as the news channel. Even those familiar with neural networks.


yeah, i read it exactly as CNN == Cable News Network and was confused for while...


I don't think it needs to be changed. I also read it as the news channel, wondered how that would make sense, and guessed it was the other sort of CNN. It's confusing but not misleading.


Not everyone is from the US...


I'm not from Europe, but if I saw "BBC converts night images to perfect daylight in ~1 second", I would assume it meant the British Broadcasting Corporation. CNN is just as big of a name. His point is absolutely relevant.

Said differently: The percentage of people who are not from the US - but are aware of CNN as the Cable News Network, is higher than the percentage of people who are not machine learning experts - but are aware of CNN as a Convolutional Neural Network


Not even close to being American and still see CNN as the News Network.

Agree that title should be changed.


We get CNN.com outside the US as well.


1. Not everyone works with ML ornis familiar with different types of neural nets

2. CNN exists outside of the US


A significant part are though and it would be helpful for that segment.


The new title is even worst tho. "Learning to See in the Dark" gives absolutely no clue on what's the article is about.


People will have the intellectual curiosity to click in to see what the title means anyways, I don't see how this is a problem in need of a fix.


I think OP was being sarcastic


Me too, but lots of people may not know so I answered it as if it were sincere.


raises hand I interpreted it as the news company, looked at Github and came back going "what does CNN have to do with this, it looks like they're uni students." Thank you for the actual acronym expansion, it's obtuse if you don't know anything about this stuff. Terrible title.


Still -1 second is impressive!


I, too, was initially very confused by the headline.


Well, Turner used to colorize b/w movies, I guess progress marches on.


Well I got the right CNN by the time I got to the end of the title, but, "Turner"? The British pastoral landscape painter? You lost me.



CNN is owned by Turner Broadcasting


Ted Turner founded CNN.


Would add a new dimension to "fake news" if it were...


Speaking of, This would help a lot of awful dark photos sent in to news sources.


They should make a CLI tool, it stores your processed images one sec before it is invoked!

See also: https://en.m.wikipedia.org/wiki/Thiotimoline

The major peculiarity of the chemical is its "endochronicity": it starts dissolving before it makes contact with water.


See also Ted Chiang's What’s Expected of Us, where there's a device with a button and a LED, and the LED always lights up one second before you press the button.


https://www.nature.com/articles/436150a

The heart of each Predictor is a circuit with a negative time delay — it sends a signal back in time.


Seriously, what/who's CNN here?


Convolutional Neural Network (for example, some more info here: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Unders...)


Convolutional neural network


I understand that the title may have brought confusion, and I think your comment calls attention to this whilst also being somewhat funny.

But still, could we make an effort not to devolve into what has happened on Reddit, i.e. comment sections which mainly consist of puns and other low effort jokes?


Meta: this is when I like the Slashdot method, you allow the jokes, tag them as jokes, and let people use their own settings to show/hide the jokes. Basically be permissive on content, but demand proper tagging, then allow people to filter out what they don't want.


Title is confusing (?intentionally).

Would make sense to add Tensorflow to make it more specific.


I don't understand what you are saying, it is very "HN" to comment on the title and not on the article.

And it is even more "HN" to comment on details of the title or the article because you don't really know what to say about the article.

Look at my comment.


Sure, and ideally those are downvoted. But low-grade jokes fit in here even less and are usually downvoted. At least that's how it is before HN finishes its transition into Reddit, at which point they won't be downvoted and we'll be back to scrolling through bad jokes like "lol CNN in the title looks like CNN the news network lmao, anyone else notice?" to find serious comments.

It's a fight worth fighting. "C'mon it's just a joke lol" or "it's only a comment about the title" only assist in that transformation.


Yeah, my mind jumped to CNN's coverage of the first Gulf War and having some color night vision for combat journalism:

https://www.thedrive.com/the-war-zone/25803/this-is-what-col...


Ah, after decades of effort, we have finally replicated the effect of a candle.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: