Within 15 minutes of setting up a https CI environment, complete with robots.txt, Googlebot was hitting the DNS name which also wasn't public or previously used or easily guessed.
Google gets a lot of leeway from people. If you have done SEO, you will learn that the Googlebot doesn't always respect the robots.txt. Requesting to de-index a page may take weeks or even months. The quickest way is file a DMCA complaint for the link to your own site.
Recently, they started tracking all downloads made on Chrome (for malwares), it includes the filename, the URL, IP and the timestamp. Sucks hard since I love Chrome and the only way to disable it is to disable the website malware checker (which only uses part of the hashes anyway).
Another possibility was that the hostnames were leaked via the SSL certificate. I've seen evidence of spiders using this for discovery, including Google. Your best protection in that case is to use a wildcard certificate, if you want it to validate.