273 points by Dontizi 2 days ago | 33 comments
andai 1 day ago
The embedding model (bge-m3 in this case) has a sequence length of 8192 tokens, i.e. rlama tries to embed the whole book, but Ollama can only put the first few pages into the embedding request.
Then when retrieving, it retrieves the entire document instead of the relevant passage (because there is no chunking), but truncates this to the first 1000 characters, i.e. the first half-page of Table of Contents.
As a result, when queried, the model says: "There is no direct mention of the Buddha in the provided documents." (The word Buddha appears 44,121 times in the documents I indexed.)
A better solution (and, as far as I can tell, what every other RAG does) is to split the document into chunks that can actually fit the context of the embedding model, and then retrieve those chunks -- ideally with metadata about which part of the document it's from.
---
I'd also recommend showing the search results to the user (I think just having a vector search engine is already an extremely useful feature, even without the AI summary / question answering), and altering the prompt to provide references (e.g. the based on the chunk metadata like page number).
Dontizi 10 hours ago
simonw 23 hours ago
Models that respond really quickly to a short sentence prompt need vastly more RAM and CPU/GPU time for significantly longer inputs. I'm finding this really damages their utility for me.
rafaelmn 23 hours ago
Books have author provided logical chunking in chapters. You can further split/summarize smaller sections and then do a hierarchical search (naive chunking kind of sucks from my experience)
elliot07 22 hours ago
danihh 22 hours ago
Wonky documentation (definitely released too early), but imo the best model agnostic diy solution out there.
mentalgear 1 day ago
- as an end user, some primary concerns re apps using the file system:
- who will be able to read it? does the app share data?
- I'm not thinking about a privacy policy, but a hard block that would not allow any internet access for the binary/app. Would rlama still work correctly ?
- is the app able to modify/delete files?
- it should be ensured that there is no "full file system" access, ie just read permission
- code note: surprised that .ts (typescript) is not listed- really crisp website: did you code it from scratch or is it template-based?
ImPostingOnHN 23 hours ago
Note that there are threat profiles for which this is not enough security.
foundzen 1 day ago
I like the fact that it is written in Go and small enough to skim over the weekend, but after repeatedly burning my time on dozens of llm ecosystem tools, I'm careful in choosing to even explore the code myself without seeing these basic disclosures upfront. I'm sure you'd see more people adopting your tool if you can provide a high-level overview of the project's architecture (ideally in a visual manner)
Dontizi 1 day ago
but for now here is the stack used: Core Language: Go (chosen for performance, cross-platform compatibility, and single binary distribution) CLI Framework: Cobra (for command-line interface structure) LLM Integration: Ollama API (for embeddings and completions) Storage: Local filesystem-based storage (JSON files for simplicity and portability) Vector Search: Custom implementation of cosine similarity for embedding retrieval
PhilippGille 1 day ago
dcreater 1 day ago
mentalgear 1 day ago
Xiol32 1 day ago
andai 1 day ago
https://github.com/DonTizi/rlama/blob/main/internal/service/...
smusamashah 1 day ago
To me getting my data from my notes correctly is most important. I use AI tools for coding occasionally (which I can easily verify on my own), for anything else I can never bring myself to be doubtless about the output.
tarruda 1 day ago
I don't know about the OP tool, but open webui has its own document database which you can integrate with LLMs, and when answering questions it always cites the source with a link for you to verify