I've never really been able to figure out what a good strategy is for object storage organization. Do you create a bucket per application instance? user? organization?
Right now I'm playing with a new service and came up with this which is probably over engineerd:
Here is an actual object key including the bucket:
7dcdb229600e4467a2714866e0d406f6/85/26c/c0271374067b5db832adb7909a7/bbda55db15266f7ce2284d8f5f66fc85e495e2b12265ef87537237ad5e2658b24c081970332417f60e5fc352ae9b8c1031398c02ecde03eb29af2d3c8eda8a4b/y18.gif
Given the file's uuid is aabbbcccccccccccccccccccccccccccccc
for original images:
{{organizations_uuid as bucket}}/aa/bbb/cccccccccccccccccccccccccccccc/{{sha512sum}}/{{originalfilename}}
And for all derivatives of it:
{{organizations_uuid as bucket}}/aa/bbb/cccccccccccccccccccccccccccccc/derived/{{this files uuid}}_{{filename}}
My thinking was that:
- using the organizations' uuid (which can have multiple users) as a bucket makes backing up per organization and having on prem deployments easier.
- Encoding the file's uuid in the object name can identify it easily and by splitting that uuid up in 2/3/rest would help with spreading of objects.
- Encoding the file's sha512sum in the key name would enable checking that file's integrity even without a database.
- putting all derived files under derived but with the original file's uuid prefix makes the link between them clear.
I know this will result in long object names as shown above in the actual example but it does include quite some information.
What parts of this is considered bad practice? Do you have any real world examples for other strategies? They seem hard to come by.
Perhaps I'm missing something about your use case, but I only create buckets per application, or sometimes file category (videos, profile images, whatever).
I don't have any other real use case for bucket per org other than easy bucket mirroring, backup and maybe migration from shared hosting to on premises.
I didn't think of using different prefixes for different media usages. We for example would then use thumbnail/originating_file_uuid.png and poster/originating_file_uuid.png.
Correct, I have no need for the original filename in most cases. If I did want this info, for example if I was building a file browser type thing (ala dropbox), then sure I'd keep that in the db.
Personally I'm uploading directly from the browser to S3 using presigned URL's. All files get uploaded to a /tmp directory in my bucket. This bucket is configured so that all files in /tmp are deleted after 1 day (to remove any unsaved uploads). When a form is submitted, I pass the key to the temporary file in the form (via e.g <input type="hidden" name="s3_key">) and create the associated database record. I then move the file from its temporary location to its permanent one upon saving said record.
Feel free to email me to continue this discussion - email address is on my profile.
> The problem with that is that the originally uploaded filename is lost. At least without storing it in a separate database.
Sure, but that's a tradeoff nearly every website accepts because they just need the image itself. If you do want to preserve the original filename, is there a reason for not just keeping it in a database?
I'd like to have these systems as decoupled as possible or at least have some meaningful information without a dependency on an external datastore. This might be just me being paranoid and overthinking it but after having to deal with a nasty monolith of an application for the last couple of years and finally convincing the rest that we need to change if we want to be able to expand I want to do it right.
What’s the point of an answer like that? Does Amazon have a history of sabotaging network links between, say, Google’s or MSFT’s networks and their own? Or is this just an attempt at being funny?
I don't think that would happen. S3 is already at a competitive advantage in same region due to the absence of data transfer costs you typically pay.
I'd be more interested if Blackblaze purchased several 10gig links per region to AWS under an arrangement like DirectConnect, or had alternative direct peering with AWS removing transit and peering exchange risks. I dont think current DirectConnect is compatible with this though as it seems Blackblaze would have to swallow all the traffic costs for every customer. I could be wrong though...
With multi-region redundancy only (as reduced redundancy is actually more expensive) and amazing integration and workflow options, B2 and S3 are not comparable products.
Right now I'm playing with a new service and came up with this which is probably over engineerd:
Here is an actual object key including the bucket: 7dcdb229600e4467a2714866e0d406f6/85/26c/c0271374067b5db832adb7909a7/bbda55db15266f7ce2284d8f5f66fc85e495e2b12265ef87537237ad5e2658b24c081970332417f60e5fc352ae9b8c1031398c02ecde03eb29af2d3c8eda8a4b/y18.gif
Given the file's uuid is aabbbcccccccccccccccccccccccccccccc
for original images: {{organizations_uuid as bucket}}/aa/bbb/cccccccccccccccccccccccccccccc/{{sha512sum}}/{{originalfilename}}
And for all derivatives of it: {{organizations_uuid as bucket}}/aa/bbb/cccccccccccccccccccccccccccccc/derived/{{this files uuid}}_{{filename}}
My thinking was that:
- using the organizations' uuid (which can have multiple users) as a bucket makes backing up per organization and having on prem deployments easier.
- Encoding the file's uuid in the object name can identify it easily and by splitting that uuid up in 2/3/rest would help with spreading of objects.
- Encoding the file's sha512sum in the key name would enable checking that file's integrity even without a database.
- putting all derived files under derived but with the original file's uuid prefix makes the link between them clear.
I know this will result in long object names as shown above in the actual example but it does include quite some information. What parts of this is considered bad practice? Do you have any real world examples for other strategies? They seem hard to come by.