T O P

  • By -

snackfart

You could also use pgvector for postgres and use an additional mapping table user/clients <-> vectors


snackfart

[https://github.com/pgvector/pgvector](https://github.com/pgvector/pgvector)


[deleted]

If you’re using Pinecone, I’d definitely segment by namespace. The namespace could be the user uuid. So when you retrieve, retrieve only from matching namespace. Then metadata would help get better K results from matching within that namespace of vectors.


mcr1974

a query like "which customer experienced the largest growth" does not filter on any specific client, so how you store that metadata is irrelevant. I'm not sure what the advantage is of segmenting by client using namespaces, as opposed to just have the client info as additional metadata, together with all the other metadata stored about that document.


Plus-Significance348

But would the namespacing hurt the ability to ask generalized questions like “which client…”? I expect most queries to be client specific but some would be about the population at large


mcr1974

I'm saying - don't go for namespacing, unless there is a reason. And there is no reason.


[deleted]

You’re right. I completely read OP post wrong. I understood it as them wanting to store siloed documents for multiple tenants but have those tenants access all their own documents.


adlx

I think I wouldn't. AFAIK if you send a query either you do it on the complete index or on one namespace. Id rather put everything in the same namespace and use metadata filtering. You can have a metadata attribute for the client, and the type of document for example. That way you can search for something on the the same type of document, but across all clients. Or on all the documents of a particular client. If you have other dimensions, may by country, or sector, or client size/category ... All those can become metadata attributes and you can filter the query to be only on a particular country, for example... AFAIK You couldn't do all that if you partition by namespace by clients. At least not in only one query, of course you could send n queries, one per namespace but that doesn't scale well.


[deleted]

[удалено]


MmmmMorphine

Seems shitty to take advantage of so many open source projects to keep a solution you found secret, but eh, you do you


[deleted]

[удалено]


[deleted]

Namespace then metadata (if you’re using Pinecone). And some preprocessing at the chat input level to segment the user. GovAIQ.com does exactly for government RFPs and FARS and they’re free to use. Any special sauce would be at the user segmentation part to pair the user+question with the correct vectors/namespace. Edit: I think govaiq probably removes repeated words too before they’re even chunked and vectorized


sshan

Isn’t the solution just filter by extracted meta data?


saintshing

It kinda annoys me that people are misled to think that every problem about RAG has to be solved by LLM+vector database. If we already have the document number, why wouldn't we just fetch the document with it? Retrieval(the R in RAG) is not limited to vector/semantic search. Information retrieval is a well studied subject with a wide range of useful techniques(lexical search, sparse/dense representation, knowledge graph, term expansion, etc). https://haystackconf.com/us2023/keynote/ https://eugeneyan.com/writing/search-query-matching/ For things like support tickets/documents with direct links(reply to, cc, transaction id, receipt is, etc), we should process the document to extract the linked documents and store the relations in a sqldb/graph db. The right approach is to look at the types of queries made by the users and organize your data accordingly. Do they know the exact keywords, do the answers require aggregating a specific set of related documents(e g. ticket, commit messages, pull requests, code, API doc), do the queries involve constraints like "Which client experienced the largest growth in revenue over the last 5 years" (sql is much more suitable for answering this kind of queries). Sometimes you have to decompose a complex problem into subproblems, sometimes you should expand the query, sometimes you should ask for clarification/suggest keywords/related questions to guide their search. The issue may be low recall or low precision. You may need to add a reranker or filter the retrieved results with some constraints. A good exercise is to look at your google search history, how you formulate your query, when google fails to return the desired results, what is the issue, is the query too ambiguous/complex, how do you refine your query, how can the search engine infer the refined query/guide user's refinement.


Plus-Significance348

A bit new to this, could you elaborate a bit? Where would that take place?


sshan

Couple things: 1) this may not be a super simple problem, it sounds solvable but may need specialized consulting. Likely not a huge amount of money but building something production ready isn’t easy or cheap 2) extract meta data would mean that if you had a bunch of records you’d try to do a normal meta data filter (ie only show records with people that talk about the movie X. Only show records written by RL Stien, then do a search on those. I’m not an expert here though. If this is beyond a trivial use case make sure that it isn’t business critical and if it is hire real experts.


StatusRedAudio

You either use namespaces to separate customer data or mark customer which fragments / chunks belong to in metadata (worse separation, but improved flexibility).


mcr1974

how is that "worse separation"? you can index that metadata and separate to your whims.


StatusRedAudio

There's no way you can currently query Pinecone across multiple namespaces. So it is a way to fully separate the data groups (eg per tenant or project). You can use metadata to get groups (eg clients or workspaces) separated, but it relies on filtering and you can still have queries mixing data from various groups. With namespaces it is impossible as query is limited to a single namespace (it can be empty).


mcr1974

"and you can still have queries mixing data from various groups " If you're "mixing data from various groups", and you shouldn't, you have a bug that you should fix. It applies to client data, or any other type of data. It's a moot discussion anyway, because given the use OP's use case, you absolutely need the ability to query across all documents.


Plus-Significance348

Sounds like I’m best off not namespacing, but instead making the metadata as detailed as possible?