I'm tired of LLM bullshitting. So I fixed it.

SuspciousCarrot78@lemmy.world · 2 minutes ago

Damn Englishmen. With their…ways.

SuspciousCarrot78@lemmy.world · 3 minutes ago

Built by an autist to give your LLM autism. No Tylenol needed.

SuspciousCarrot78@lemmy.world · 4 minutes ago

I’ll look. I have no idea what that is.

SuspciousCarrot78@lemmy.world · 4 minutes ago

My brother in virtual silicon: I run this shit on a $200 p.o.s with 4gb of VRAM.

If you can run an LLM at all, this will run. BONUS: because of the way “Vodka” operates, you can run with a smaller context window without eating shit of OOM errors. So…that means… if you could only run a 4B model (because the GGUF itself is 3GBs without the over-heads…then you add in the drag from the KV cache accumulation)… maybe you can now run next sized up model…or enjoy no slow down chats with the model size you have.

SuspciousCarrot78@lemmy.world · 9 minutes ago

Yes. Several reasons -

Focuses on making LOCAL LLMs more reliable. You can hitch it to OpenRouter or ChatGPT if you want to leak you personal deets everywhere, but that’s not what this is for. I built this to make local, self hosted stuff BETTER.
Entire system operates on curating (and ticketing with provenance trails) local data…so you don’t need to YOLO request thru god knows where to pull information.
In theory, you could automate a workflow that does this - poll SearXNG, grab whatever you wanted to, make a .md summary, drop it into your KB folder, then tell your LLM “do the thing”. Or even use Scrapy if you prefer: https://github.com/scrapy/scrapy
Your memory is stored on disk, at home, on a tamper proof file, that you can inspect. No one else can see it. It doesn’t get leaked by the LLM any where. Because until you ask it, it literally has no idea what facts you’ve stored. The content of your KBs, memory stores etc are CLOSED OFF from the LLM.

SuspciousCarrot78@lemmy.world · 19 minutes ago

This is a quote from Deming, one of the fathers of modern data analysis. It basically means “I don’t trust you. You’re not god. Provide citations or retract your statement”

SuspciousCarrot78@lemmy.world · 22 minutes ago

Once again: I am a meat popsicle (with ASD), not AI. All errors and foibles are mine :)

SuspciousCarrot78@lemmy.world · 23 minutes ago

LOL. Don’t do that. Wikipedia is THE nosiest source.

Would you like me to show you HOW and WHY the SUMM pathway works? I built it after I tried a “YOLO wikipedia in that shit - done, bby!”. It…ended poorly

SuspciousCarrot78@lemmy.world · 25 minutes ago

Correct. Curate your sources :)

I can’t LoRa stupid out of a model…but I can do this. If your model is at all obedient and non-stupid, and reasons from good sources, it will do well with the harness.

Would you like to see the benchmarks for the models I recommend in the “minimum reccs” section? They are very strong…and not chosen at random.

Like the router, I bring receipts :)

SuspciousCarrot78@lemmy.world · 28 minutes ago

Probably that latter. I unironically used “Obeyant” the other day, like a time traveling barrister from the 1600s.

I have 2e ASD and my hyperfocus is language.

SuspciousCarrot78@lemmy.world · 31 minutes ago

Good question.

It doesn’t “correct” the model after the fact. It controls what the model is allowed to see and use before it ever answers.

There are basically three modes, each stricter than the last. The default is “serious mode” (governed by serious.py). Low temp, punishes chattiness and inventiveness, forces it to state context for whatever it says.

Additionally, Vodka (made up of two sub-modules - “cut the crap” and “fast recall”) operate at all times. Cut the crap trims context so the model only sees a bounded, stable window. You can think of it like a rolling, summary of what’s been said. That summary is not LLM generated summary either - it’s concatenation (dumb text matching), so no made up vibes.

Fast recall OTOH stores and recalls facts verbatim from disk, not from the model’s latent memory.

It writes what you tell it to a text file and then when you ask about it, spits it back out verbatim ((!! / ??)

And that’s the baseline

In KB mode, you make the LLM answer based on the above settings + with reference to your docs ONLY (in the first instance).

When you >>attach <kb>, the router gets stricter again. Now the model is instructed to answer only from the attached documents.

Those docs can even get summarized via an internal prompt if you run >>summ new, so that extra details are stripped out and you are left with just baseline who-what-where-when-why-how.

The SUMM_*.md file come SHA-256 provenance, so every claim can be traced back to a specific origin file (which gets moved to a subfolder)

TL;DR: If the answer isn’t in the KB, it’s told to say so instead of guessing.

Finally, Mentats mode (Vault / Qdrant). This is the “I am done with your shit" path.

It’s all of the three above PLUS a counter-factual sweep.

It runs ONLY on stuff you’ve promoted into the vault.

What it does is it takes your question and forms in in a particular way so that all of the particulars must be answered in order for there to BE an answer. Any part missing or not in context? No soup for you!

In step 1, it runs that past the thinker model. The answer is then passed onto a “critic” model (different llm). That model has the job of looking at the thinkers output and say “bullshit - what about xyz?”.

It sends that back to the thinker…who then answers and provides final output. But if it CANNOT answer the critics questions (based on the stored info?). It will tell you. No soup for you, again!

TL;DR:

The “corrections” happen by routing and constraint. The model never gets the chance to hallucinate in the first place, because it literally isn’t shown anything it’s not allowed to use. Basic premise - trust but verify (and I’ve given you all the tools I could think of to do that).

Does that explain it better? The repo has a FAQ but if I can explain anything more specifically or clearly, please let me know. I built this for people like you and me.

SuspciousCarrot78@lemmy.world · 1 hour ago

On the stuff you use the pipeline/s on? About 85-90% in my tests.

Just don’t GIGO (Garbage in, Garbage Out) your source docs…and don’t use a retarded LLM.

That’s why I recommend Qwen3-4 2507 Instruct. It does what you tell it to (even the abilterated one I use).

Random Sexy-fun-bot900-HAVOK-MATRIX-1B.gguf? I couldn’t say :)

SuspciousCarrot78@lemmy.world · 1 hour ago

Thank you <3

Please let me know how it works…and enjoy the >>FR settings. If you’ve ever wanted to trolled by Bender (or a host of other 1990s / 2000s era memes), you’ll love it.

SuspciousCarrot78@lemmy.world · 1 hour ago

Hmm. I dunno - never tried. I suppose if the wiki could be imported in a compatible format…it should be able to chew thru it just fine. Wiki’s are usually just gussied up text files anyway :) Drop the contents of your wiki in there a .md’s and see what it does

SuspciousCarrot78@lemmy.world · 1 hour ago

I'm tired of LLM bullshitting. So I fixed it.

SuspciousCarrot78@lemmy.world · 1 hour ago

Hell yes I can explain. What would you like to know.

SuspciousCarrot78@lemmy.world · 2 hours ago

I'm tired of LLM bullshitting. So I fixed it.

SuspciousCarrot78@lemmy.world · 4 hours ago

The deed is done! Woot! I’ll do a longer post else where but you get to be the first cab off the rank :)

https://codeberg.org/BobbyLLM/llama-conductor

OR

https://github.com/BobbyLLM/llama-conductor <—mirror

SuspciousCarrot78@lemmy.world · edit-2 1 day ago

Exactly. This is the worst kind of corporate “In these trying times, we care” theatre. I expected better from DDG.

Poorly scoped, poorly defined, taps into vague “AI = Bad!” fears.

What exactly am I voting on here? Vibes?

This doesn’t increase my trust index for DDG. If anything, it makes me wonder about DDGs agenda.

Very Veridian Dynamics, DDG.

Fail.

PS: Self host SearXNG > DDG

SuspciousCarrot78@lemmy.world · 1 day ago

Rinzler…what have you become?

SuspciousCarrot78@lemmy.world · edit-2 2 days ago

Additionally, in windows (linux too?) one could use Moonlight / Sunshine to compute on the GPU and stream to secondary device (either directly, like say to a Chromecast, or via the iGPU to their monitor). Latency is quite small in most circumstances, and allows for some interesting tricks (eg: server GPUs allow you to split GPU into multiple “mini-gpus” - essentially, with the right card, you could host two+ entirely different, concurrent instances of GTA V on one machine, via one physical GPU).

A bit hacky, but it works.

Source: I bought a Tesla P4 for $100 and stuck it in a 1L case.

GPU goes brrr