Related:
Major cyber attack could cost the world $3.5 trillion - Power Grid, Internet Outage
The one database/file/zip to save humanity, what is it?
Show Lemmy the downloadable URL of a Database or AI you know of so we can have a local backup copy that will improve the resilience and availability of Human Knowledge.
Given the state of AI being Corporatized I think we could definitely use links for whatever comes closest to a fully usable Open Source, fully self-contained downloadable AI.
Starter Pack:
- Wikipedia Single 100GB File
- http://sci-hub.wf/
- Arxiv Download Script
- https://wholeearth.info/
- https://the-eye.eu/
- Endless OS “Offline Library”
- scikit-learn AI with External Databases
- ScienceFair
★ Lemmy List
Databases
AI
This is too much catastrophism for my taste, but If I wanted to start archiving, I’ll start by downloading Wikipedia, The Library Genesis and the Gutenberg Project.
Videos are too heavy to archive with ease, and they are probably of much less value of actual knowledge.
Haven’t heard about the Gutenberg project before, seems pretty neat!
I’d probably add repair.wiki to a list of things I’d archive, although some of that content is picture heavy so not as easily compressible as Wikipedia
There was a project that allows you to download wikipedia and some other online resources into an easy to search & navigate UI, think it was called Kiwi something but can’t remember. It was targeted at regions with poor internet coverage
Yup Kiwix, an app available for Android, iOS, Linux and possibly other OSs too.
Project Gutenberg has been a thing for a couple decades. I think they are starting to also create free audiobooks from books they have in their collection. There is an TTS AI service that I checked out a week ago (play.ht)and that does voicing very realistically from the text that I gave it and I might spring spend $40 for a month of that service and build some audiobooks. The paid version gives access to more voices and will do 1 million characters of text a year.
Or if anyone knows a good open source online alternative, I’m all ears. I’d prefer to go that route but did not give anything that was a very good solution.
Humanity has been using writing for millennia. It’s a proven technology. Photographs and video don’t tend to last longer than the one institution or family that cares about them.
Plus writing dgaf if you get hit with a carrington event
Mostly due to previous physical constraints, I would argue. Thankfully there are fewer chances your hard drive is going to decompose into vinegar while sitting in your cupboard, and even if it does, it’s likely not the only copy.
They’re also more limited for current data because they’re harder to parse and convert into other usable formats, but thankfully that will get better over time too.
I still preference text-first data for various reasons, but let’s not dismiss the leagues of potential video has for communication and archival value, both intentional and unintentional.
Perhaps think of it more as knowledge decentralization as a form of resiliency for unplanned network outages. Sometimes the library of Alexandria just happens to catch fire, and it might be nobody’s fault at all.
Besides, plenty of people grew up in families with a basic encyclopaedia or dictionary or a repair manual. This is essentially the same thing, just with less paper.
I’m particulary looking for anyone that already has a collection of Arxiv and Sci-Hub papers. Please curate your collection and make it available here!
We also need a hashtag/topic/keyword for this project that is brief and catchy we can also use for a GitHub search, etc. Anyone?
Is it possible to download an archive of scihub?
Sci-Hub is ENORMOUS, about 100TB. If you want to help preserve it, you can torrent and seed one of their many 100GB chunks.
What a fantastic resource, this is exactly what is needed. I also found about The Standard Template Construct Library:
“Learn about how to access large corpus of high-quality scholarly texts using Python and use them in AI apps”
Super cool never knew about this. I got probably 1-2tb I can spare for the effort.
Does anyone know if a LLM has been trained on something like scihub?