Skip to content

Commit 67db524

Browse files
committed
Merge feature/knowledge-base: Knowledge Base RAG pipeline (v0.3.3)
Add documentation-aware RAG pipeline for Elixir/OTP knowledge: - knowledge_entries table with pgvector 768-dim HNSW index - Knowledge context module with keyword-first hybrid search - SkillAPI knowledge_read/knowledge_write permissions - LLM resolve_api_key config fallback for Gemini/Anthropic - HexDocs scraper dynamic skill (modules + guides via sidebar JSON) - Chat RAG integration with context source selector - 452 tests passing, 0 failures
2 parents e3a360e + 674f79f commit 67db524

File tree

13 files changed

+1067
-34
lines changed

13 files changed

+1067
-34
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ AlexClaw monitors the world (RSS feeds, web sources, GitHub repositories, APIs),
2424
- **Telegram Gateway** — Bidirectional communication via long-polling. Command routing is deterministic pattern-matching — no LLM involved in dispatch.
2525
- **Runtime Configuration** — All settings (API keys, prompts, limits, personas) are stored in PostgreSQL, cached in ETS, and editable at runtime via the admin UI. No restart required for any config change.
2626
- **Persistent Memory with Semantic Search** — PostgreSQL + pgvector for knowledge storage. Deduplication by URL. Hybrid search combines vector cosine similarity and keyword matching — vector results are prioritized, keyword results fill gaps for exact matches. Embeddings are generated asynchronously via the LLM router (Gemini `gemini-embedding-001`, Ollama `nomic-embed-text`, or any OpenAI-compatible endpoint). 768-dimension vectors with HNSW index. All skills that store knowledge auto-embed in the background.
27+
- **Knowledge Base RAG** — Separate `knowledge_entries` table for documentation and reference material, isolated from news/conversation memory. Scraper skills fetch, chunk, and embed documentation from hexdocs.pm (API reference + official guides). Chat integrates both Knowledge and Memory search with a context source selector (Docs only / Memory only / Both / None). System prompt instructs the LLM to cite provided documentation over general knowledge. Currently covers 22 Elixir ecosystem packages including full Elixir stdlib and 53 official guides.
2728
- **Cron Scheduler** — Quantum-based. Jobs defined in config or DB.
2829

2930
### Skills
@@ -44,6 +45,7 @@ AlexClaw monitors the world (RSS feeds, web sources, GitHub repositories, APIs),
4445
| `google_calendar` | Fetch upcoming Google Calendar events |
4546
| `google_tasks` | Manage Google Tasks lists and items |
4647
| `web_automation` | Browser automation via headless Playwright sidecar (**experimental**) |
48+
| `hexdocs_scraper` | Scrape hexdocs.pm docs into knowledge base embeddings (dynamic) |
4749

4850
### Dynamic Skill Loading (**experimental**)
4951

ROADMAP.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,10 @@ Additional notification/command channels beyond Telegram:
6262

6363
## Someday
6464

65+
### ~~Knowledge Base RAG~~ ✅ Completed (v0.3.3)
66+
67+
Separate `knowledge_entries` table with pgvector HNSW index for documentation and reference material. HexDocs scraper skill discovers modules via sidebar JSON, chunks by section/function, and embeds via local nomic-embed-text or Gemini. Chat RAG integration with context source selector (Docs/Memory/Both/None). Keyword-first hybrid search for precise documentation retrieval. LLM API key resolution falls back from provider record to config settings. Currently 22 packages scraped (4200+ chunks), including full Elixir stdlib and 53 official guides.
68+
6569
### ~~Semantic Search (Memory)~~ ✅ Completed (v0.2.1)
6670

6771
Hybrid search combining pgvector cosine similarity and keyword matching. Embeddings generated asynchronously via Gemini `text-embedding-004`, Ollama `nomic-embed-text`, or any OpenAI-compatible endpoint. 768-dimension vectors with HNSW index. All skills auto-embed stored knowledge in the background. Batch re-embed support for model switching.

lib/alex_claw/knowledge.ex

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
defmodule AlexClaw.Knowledge do
2+
@moduledoc """
3+
Knowledge base store for documentation, guides, and reference material.
4+
Separate from Memory (news/facts) to keep embeddings cleanly partitioned.
5+
Supports hybrid search: pgvector cosine similarity + keyword fallback.
6+
Embeddings are generated asynchronously under TaskSupervisor.
7+
"""
8+
require Logger
9+
import Ecto.Query
10+
alias AlexClaw.Repo
11+
alias AlexClaw.Knowledge.Entry
12+
13+
@type store_opts :: [source: String.t() | nil, metadata: map(), expires_at: DateTime.t() | nil]
14+
15+
@spec store(atom() | String.t(), String.t(), store_opts()) ::
16+
{:ok, Entry.t()} | {:error, Ecto.Changeset.t()}
17+
def store(kind, content, opts \\ []) do
18+
source = Keyword.get(opts, :source)
19+
metadata = Keyword.get(opts, :metadata, %{})
20+
expires_at = Keyword.get(opts, :expires_at)
21+
22+
result =
23+
%Entry{}
24+
|> Entry.changeset(%{
25+
kind: to_string(kind),
26+
content: content,
27+
source: source,
28+
embedding: nil,
29+
metadata: metadata,
30+
expires_at: expires_at
31+
})
32+
|> Repo.insert()
33+
34+
case result do
35+
{:ok, entry} ->
36+
async_embed(entry)
37+
{:ok, entry}
38+
39+
error ->
40+
error
41+
end
42+
end
43+
44+
@spec search(String.t(), keyword()) :: [Entry.t()]
45+
def search(query, opts \\ []) do
46+
limit = Keyword.get(opts, :limit, 10)
47+
kind = Keyword.get(opts, :kind)
48+
49+
keyword_results = keyword_search(query, kind, limit)
50+
51+
case AlexClaw.LLM.embed(query) do
52+
{:ok, embedding} ->
53+
vector_results = vector_search(embedding, kind, limit)
54+
merge_results(keyword_results, vector_results, limit)
55+
56+
{:error, _} ->
57+
keyword_results
58+
end
59+
end
60+
61+
@spec exists?(String.t()) :: boolean()
62+
def exists?(source_url) do
63+
Entry
64+
|> where([e], e.source == ^source_url)
65+
|> Repo.exists?()
66+
end
67+
68+
@spec recent(keyword()) :: [Entry.t()]
69+
def recent(opts \\ []) do
70+
limit = Keyword.get(opts, :limit, 20)
71+
kind = Keyword.get(opts, :kind)
72+
73+
Entry
74+
|> maybe_filter_kind(kind)
75+
|> order_by([e], desc: e.inserted_at)
76+
|> limit(^limit)
77+
|> Repo.all()
78+
end
79+
80+
@spec count(atom() | String.t() | nil) :: non_neg_integer()
81+
def count(kind \\ nil) do
82+
Entry
83+
|> maybe_filter_kind(kind)
84+
|> Repo.aggregate(:count)
85+
end
86+
87+
@spec reembed_all(keyword()) :: {:ok, non_neg_integer()}
88+
def reembed_all(opts \\ []) do
89+
batch_size = Keyword.get(opts, :batch_size, 20)
90+
max_concurrency = Keyword.get(opts, :max_concurrency, 2)
91+
92+
entries =
93+
Entry
94+
|> where([e], is_nil(e.embedding))
95+
|> Repo.all()
96+
97+
count = length(entries)
98+
99+
if count > 0 do
100+
caller = self()
101+
102+
Task.Supervisor.start_child(AlexClaw.TaskSupervisor, fn ->
103+
sandbox_allow(caller)
104+
Logger.info("Re-embedding #{count} knowledge entries...")
105+
106+
entries
107+
|> Enum.chunk_every(batch_size)
108+
|> Enum.each(fn batch ->
109+
batch
110+
|> Task.async_stream(
111+
fn entry -> embed_entry(entry) end,
112+
max_concurrency: max_concurrency,
113+
timeout: 30_000,
114+
on_timeout: :kill_task
115+
)
116+
|> Stream.run()
117+
end)
118+
119+
Logger.info("Re-embedding complete: processed #{count} knowledge entries")
120+
end)
121+
end
122+
123+
{:ok, count}
124+
end
125+
126+
# --- Internal ---
127+
128+
defp async_embed(%Entry{} = entry) do
129+
caller = self()
130+
131+
Task.Supervisor.start_child(AlexClaw.TaskSupervisor, fn ->
132+
sandbox_allow(caller)
133+
embed_entry(entry)
134+
end)
135+
end
136+
137+
defp sandbox_allow(caller) do
138+
if Application.get_env(:alex_claw, AlexClaw.Repo)[:pool] == Ecto.Adapters.SQL.Sandbox do
139+
Ecto.Adapters.SQL.Sandbox.allow(AlexClaw.Repo, caller, self())
140+
end
141+
end
142+
143+
defp embed_entry(%Entry{id: id, content: content}) do
144+
case AlexClaw.LLM.embed(content) do
145+
{:ok, vector} when is_list(vector) ->
146+
case Repo.get(Entry, id) do
147+
nil -> :ok
148+
entry -> entry |> Entry.changeset(%{embedding: vector}) |> Repo.update()
149+
end
150+
151+
{:error, reason} ->
152+
Logger.warning("Embedding failed for knowledge entry #{id}: #{inspect(reason)}")
153+
:ok
154+
end
155+
end
156+
157+
defp merge_results(keyword_results, vector_results, limit) do
158+
# Keyword matches are more precise for documentation, so they go first.
159+
# Then fill with vector results that weren't already found by keyword.
160+
keyword_ids = MapSet.new(keyword_results, & &1.id)
161+
162+
new_vector =
163+
vector_results
164+
|> Enum.reject(fn e -> MapSet.member?(keyword_ids, e.id) end)
165+
166+
(keyword_results ++ new_vector)
167+
|> Enum.take(limit)
168+
end
169+
170+
defp vector_search(embedding, kind, limit) do
171+
Entry
172+
|> maybe_filter_kind(kind)
173+
|> where([e], not is_nil(e.embedding))
174+
|> order_by([e], fragment("embedding <=> ?", ^embedding))
175+
|> limit(^limit)
176+
|> Repo.all()
177+
end
178+
179+
defp keyword_search(query, kind, limit) do
180+
terms =
181+
query
182+
|> String.replace(~r/[?!.,;:()\[\]{}"']/, " ")
183+
|> String.split(~r/\s+/, trim: true)
184+
|> Enum.reject(fn t -> String.length(t) < 3 end)
185+
|> Enum.reject(fn t -> String.downcase(t) in ~w(the and for how does what which with from that this are was were can) end)
186+
|> Enum.take(5)
187+
188+
case terms do
189+
[] ->
190+
[]
191+
192+
terms ->
193+
Enum.reduce(terms, Entry |> maybe_filter_kind(kind), fn term, q ->
194+
pattern = "%#{term}%"
195+
where(q, [e], ilike(e.content, ^pattern))
196+
end)
197+
|> order_by([e], desc: e.inserted_at)
198+
|> limit(^limit)
199+
|> Repo.all()
200+
end
201+
end
202+
203+
defp maybe_filter_kind(queryable, nil), do: queryable
204+
defp maybe_filter_kind(queryable, kind), do: where(queryable, [e], e.kind == ^to_string(kind))
205+
end

lib/alex_claw/knowledge/entry.ex

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
defmodule AlexClaw.Knowledge.Entry do
2+
@moduledoc "Ecto schema for knowledge base entries with vector embeddings."
3+
4+
use Ecto.Schema
5+
import Ecto.Changeset
6+
7+
schema "knowledge_entries" do
8+
field :kind, :string
9+
field :content, :string
10+
field :source, :string
11+
field :embedding, Pgvector.Ecto.Vector
12+
field :metadata, :map, default: %{}
13+
field :expires_at, :utc_datetime
14+
15+
timestamps(type: :utc_datetime)
16+
end
17+
18+
@type t :: %__MODULE__{
19+
id: integer() | nil,
20+
kind: String.t(),
21+
content: String.t(),
22+
source: String.t() | nil,
23+
embedding: list(float()) | nil,
24+
metadata: map(),
25+
expires_at: DateTime.t() | nil,
26+
inserted_at: DateTime.t() | nil,
27+
updated_at: DateTime.t() | nil
28+
}
29+
30+
@spec changeset(t() | Ecto.Changeset.t(), map()) :: Ecto.Changeset.t()
31+
def changeset(entry, attrs) do
32+
entry
33+
|> cast(attrs, [:kind, :content, :source, :embedding, :metadata, :expires_at])
34+
|> validate_required([:kind, :content])
35+
end
36+
end

lib/alex_claw/llm.ex

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@ defmodule AlexClaw.LLM do
169169
end
170170

171171
defp call_embedding(%Provider{type: "gemini"} = p, text, model) do
172-
api_key = p.api_key || ""
172+
api_key = resolve_api_key(p)
173173

174174
if api_key == "",
175175
do: {:error, :api_key_not_set},
@@ -194,6 +194,22 @@ defmodule AlexClaw.LLM do
194194
{:error, :anthropic_no_embeddings}
195195
end
196196

197+
# --- API Key Resolution ---
198+
199+
@config_key_map %{
200+
"gemini" => "llm.gemini_api_key",
201+
"anthropic" => "llm.anthropic_api_key"
202+
}
203+
204+
defp resolve_api_key(%Provider{api_key: key}) when is_binary(key) and key != "", do: key
205+
206+
defp resolve_api_key(%Provider{type: type}) do
207+
case Map.get(@config_key_map, type) do
208+
nil -> ""
209+
config_key -> AlexClaw.Config.get(config_key) || ""
210+
end
211+
end
212+
197213
# --- Gemini Embeddings ---
198214

199215
defp call_embedding_gemini(api_key, text, model) do
@@ -335,15 +351,15 @@ defmodule AlexClaw.LLM do
335351
# --- Provider Calls ---
336352

337353
defp call_provider(%Provider{type: "gemini"} = p, prompt, system) do
338-
api_key = p.api_key || ""
354+
api_key = resolve_api_key(p)
339355

340356
if api_key == "",
341357
do: {:error, :api_key_not_set},
342358
else: call_gemini(p.model, api_key, prompt, system)
343359
end
344360

345361
defp call_provider(%Provider{type: "anthropic"} = p, prompt, system) do
346-
api_key = p.api_key || ""
362+
api_key = resolve_api_key(p)
347363

348364
if api_key == "",
349365
do: {:error, :api_key_not_set},

lib/alex_claw/skills/skill_api.ex

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ defmodule AlexClaw.Skills.SkillAPI do
1010
"""
1111
require Logger
1212

13-
@known_permissions ~w(llm telegram_send memory_read memory_write web_read config_read resources_read skill_invoke)a
13+
@known_permissions ~w(llm telegram_send memory_read memory_write knowledge_read knowledge_write web_read config_read resources_read skill_invoke)a
1414

1515
def known_permissions, do: @known_permissions
1616

@@ -78,6 +78,29 @@ defmodule AlexClaw.Skills.SkillAPI do
7878
end
7979
end
8080

81+
# --- Knowledge ---
82+
83+
@doc "Search knowledge base by semantic similarity. Opts: :limit, :kind"
84+
def knowledge_search(skill_module, query, opts \\ []) do
85+
with :ok <- check_permission(skill_module, :knowledge_read) do
86+
{:ok, AlexClaw.Knowledge.search(query, opts)}
87+
end
88+
end
89+
90+
@doc "Check if a source URL already exists in knowledge base."
91+
def knowledge_exists?(skill_module, source_url) do
92+
with :ok <- check_permission(skill_module, :knowledge_read) do
93+
{:ok, AlexClaw.Knowledge.exists?(source_url)}
94+
end
95+
end
96+
97+
@doc "Store a knowledge entry. Opts: :source, :metadata, :expires_at"
98+
def knowledge_store(skill_module, kind, content, opts \\ []) do
99+
with :ok <- check_permission(skill_module, :knowledge_write) do
100+
AlexClaw.Knowledge.store(kind, content, opts)
101+
end
102+
end
103+
81104
# --- HTTP ---
82105

83106
@doc "HTTP GET. All Req options are passed through (headers, receive_timeout, params, etc)."

0 commit comments

Comments
 (0)