85% of 'Hard' word flags stem from unrecognized inflected forms — this PR surfaces a one-line surface→lemma explanation for the conjugations that existing color-band decomposition couldn't reach.
be55c77
Morphology bridge: explain inflected forms on a confused/missed mark
6/3/2026
When an Arabic learner taps a yellow 'missed/confused' mark for a word they've studied, the help panel shows a morphological color-band decomposition (clitics, derived forms). But an analysis of confusion-capture data revealed that ~85% of 'Hard' flags are form-recognition failures, and the existing decomposer only explained ~45% of inflected surfaces — it was silent on verb conjugations (present tense, past with gender/number) because these aren't stored in forms_json. Learners who knew the lemma "أَفْسَدَ" but failed to recognize its present-tense "يُفْسِدُ" got no help at all.
This PR introduces classify_surface_morphology() — a shared rule-based classifier that covers the full range of inflection types: verb present-tense (prefix heuristic), other verb conjugations, derived forms matched via forms_json, proclitics, enclitics, and a catch-all inflection category. For the verb-tense cases the color bands can't render, it generates a one-line explanation string (e.g. "present-tense form of «to spoil»"). The function is wired into three places: the analyze_confusion read path (exposed in ConfusionAnalysisOut), the submit-sentence write path (stored per-surface in variant_stats_json so confusion is queryable, not re-derived), and the confusion_help interaction log (records morph_category per yellow tap).
The result is a morphology field flowing from backend service to Pydantic schema to TypeScript type to a new morphBridge UI widget in WordInfoCard. The old _match_surface_form private helper in sentence_review_service is removed — its narrower logic is fully superseded by classify_surface_morphology. 11 new unit tests cover edge cases including the Form-IV past-not-present guard (the lemma "أَفْسَدَ" starts with أ, which is also a present-tense prefix — the classifier must not misidentify the dictionary form itself as present tense).
confusion_service.py:820-830 — Walk through the Form-IV present-tense guard: not lemma_bare.startswith(core[0]). Trace it with lemma 'افسد' and surface 'افسد' (identity, caught earlier), then surface 'يفسد' (core[0]='ي', lemma starts with 'ا' ≠ 'ي', fires correctly), then surface 'أفعل' for a Form-IV lemma starting with أ (core[0]='أ', lemma_bare starts with 'أ', guard suppresses — correct).
The guard at confusion_service.py (if core and core[0] in _PRESENT_PREFIXES and not lemma_bare.startswith(core[0])) works correctly for all three cases. The test test_form_iv_past_is_identity_not_present confirms the identity case returns None, and test_verb_present_not_in_forms confirms 'يفسد' with lemma_bare='افسد' (starts with 'ا' ≠ 'ي') fires correctly. For a Form-IV lemma starting with 'أ', core[0]='أ' and lemma_bare.startswith('أ') would be True, so the guard suppresses — correct.
confusion_service.py:888-892 — Verify the new has_morph logic: decomposition is not None or bool(morphology and morphology.get('explanation')). This means verb_other also counts as morphological (it always has an explanation). Is that intentional? A past-tense conjugation of a verb the learner knows should probably count as morphological confusion, but confirm this is the desired classification.
The code at confusion_service.py sets has_morph = decomposition is not None or bool(morphology and morphology.get('explanation')). Both verb_present and verb_other always return a non-None explanation, so both count as morphological. The comment in the diff explicitly states 'A verb-tense explanation counts as morphological even when decompose_surface returned nothing', confirming this is intentional design.
frontend/lib/review/WordInfoCard.tsx:564-578 — Check whether morphBridge can render simultaneously with a non-None decomposition. If a surface has both clitics (decomposition set) and is also a verb tense (explanation set), both the color bands and the bridge text would appear. Based on the classifier logic this shouldn't happen (explanation only set when form_key is None, but decomp can still be set for clitic cases)... verify with a proclitic + verb-present surface like 'ليفسد'.
In classify_surface_morphology, when the surface is a verb with a proclitic (e.g. 'ليفسد'), the code strips the proclitic from core and then checks if core[0] is a present prefix. If it matches, it returns {category: 'verb_present', explanation: '...'}. Meanwhile, decompose_surface may also find the 'ل' proclitic and return a decomposition. In analyze_confusion, both decomposition and morphology.explanation could be non-None simultaneously, causing both the color bands AND the morphology bridge text to render in WordInfoCard. The classifier logic does not suppress explanation when a decomposition exists.
backend/tests/test_sentence_review.py:373 — The test seeds surface_form='يُفْسِدُ' but asserts on ulk.variant_stats_json['يفسد']. Confirm strip_diacritics is called before the variant_stats key lookup in submit_sentence_review — if it's not, the key lookup will fail silently (missing key returns empty dict, tests may pass but for wrong reasons).
The diff does not show the key lookup code in submit_sentence_review where vstats[surface_bare] is set. The variable is named surface_bare suggesting diacritics are stripped before use, but the actual stripping call is not visible in the diff (it's in unchanged code). The test would need to pass with the diacritized surface 'يُفْسِدُ' being stripped to 'يفسد' before the vstats key is set.
backend/app/services/sentence_review_service.py:352-363 — The new code calls classify_surface_morphology on every missed/confused surface. This runs decompose_surface() internally. Check the performance profile: is decompose_surface() called twice now (once in the read path when building confusion help, and once here in the write path)? For the submit-sentence write path this is expected, but confirm it's not accidentally called in a hot loop.
The write path in submit_sentence_review calls classify_surface_morphology once per missed/confused surface word when updating vstats. This is a separate code path from the read path (analyze_confusion in confusion_service.py). The call is inside the loop over sentence words but only executes for missed/confused surfaces, which is expected and acceptable for a write path.
frontend/lib/types.ts:1104-1110 — The category union type will cause a TypeScript error if the backend adds a new category without updating the frontend. This is a feature (schema safety), but it means the two files must be updated atomically. Consider adding a comment cross-referencing the backend constant to make this dependency explicit for future maintainers.
The SurfaceMorphology interface in frontend/lib/types.ts has a strict union type for category with no cross-reference comment pointing to the backend schemas.py or confusion_service.py. The backend comment in schemas.py says # verb_present | verb_other | derived_form | proclitic | enclitic | inflection but neither file references the other, making this a maintenance risk.
docs/data-model.md — The updated variant_stats_json description says 'each entry also stores a category'. This is only true going forward — historical rows written by _match_surface_form have no category field. If any analytics or product code reads this field without a default, it could fail on historical data. Verify all reads of variant_stats_json['category'] use .get('category') or equivalent.
The diff only shows the write path in sentence_review_service.py:352-363 setting entry['category'] = morph['category'], which uses dict assignment (not an issue). However, any read-side code consuming variant_stats_json entries and accessing ['category'] directly (without .get('category')) would fail on historical rows — but no read-side code is visible in this diff to verify.
With the classifier in place, the next step is integrating it into the analyze_confusion() read path so it flows through to the API response. Before this change, analyze_confusion called decompose_surface() and used has_morph = decomposition is not None to decide whether the confusion was 'morphological'. The problem: for verb conjugations (no forms_json entry), decomposition is None, so these were silently classified as non-morphological even though the learner's confusion was entirely form-based.
The change adds a call to classify_surface_morphology() after the decomposition step, then broadens the has_morph logic: a verb-tense explanation (even without a decomposition) now counts as morphological. This matters for the confusion_type field — 'morphological', 'visual', 'both', or None — which controls how the UI frames the analysis.
The morphology dict is also added to the returned result dict, and the Pydantic schema ConfusionAnalysisOut gains a corresponding SurfaceMorphology | None field. The new SurfaceMorphology schema is a proper class (not an inline dict), which means Pydantic validates category values and the OpenAPI spec documents the shape correctly.
The broadened has_morph means some confusions previously classified as None or visual will now become morphological or both. Verify that the frontend handles a confusion_type of morphological when decomposition is still None (only morphology.explanation is set) — the WordInfoCard might render an empty decomposition section.
By giving SurfaceMorphology its own Pydantic model rather than leaving it as a raw dict in the response, the OpenAPI schema now documents the exact shape. This is the correct pattern for any data structure that crosses the API boundary — it makes the frontend TypeScript types authoritative rather than inferred.
|
@@ -815,8 +886,13 @@ def analyze_confusion(
|
|
|
815
886
|
# 3. Prefix disambiguation hint
|
|
816
887
|
prefix_hint = _build_prefix_hint(surface_bare, lemma_bare, lemma.root, decomposition)
|
|
817
888
|
|
|
889
|
+
# 4. Morphology classification — coarse category + a one-line surface->lemma
|
|
890
|
+
# explanation for the verb-tense cases the color-band decomposition can't show.
|
|
891
|
+
morphology = classify_surface_morphology(surface_bare, lemma)
|
|
892
|
+
|
|
818
|
-
# Determine confusion type
|
|
893
|
+
# Determine confusion type. A verb-tense explanation counts as morphological
|
|
894
|
+
# even when decompose_surface returned nothing (so the bridge data still flows).
|
|
819
|
-
has_morph = decomposition is not None
|
|
895
|
+
has_morph = decomposition is not None or bool(morphology and morphology.get("explanation"))
|
|
820
896
|
has_visual = len(similar_words) > 0
|
|
821
897
|
|
|
822
898
|
if has_morph and has_visual:
|
|
@@ -835,6 +911,7 @@ def analyze_confusion(
|
|
|
835
911
|
"lemma_ar": lemma.lemma_ar,
|
|
836
912
|
"gloss_en": lemma.gloss_en,
|
|
837
913
|
"decomposition": decomposition,
|
|
914
|
+
"morphology": morphology,
|
|
838
915
|
"similar_words": similar_words,
|
|
839
916
|
"phonetic_similar": phonetic_similar,
|
|
840
917
|
"prefix_hint": prefix_hint,
|
Replaces the two-line 'Determine confusion type' block. Previously has_morph was purely decomposition is not None. Now it also considers whether morphology carries an explanation — bridging the verb-conjugation gap where decompose_surface returns nothing but the morphology classifier still identifies the form type.
Adds morphology to the result dict returned by analyze_confusion. Previously this key was absent entirely — any caller inspecting result.get('morphology') would get None.
|
@@ -914,6 +914,12 @@ class PrefixHint(BaseModel):
|
|
|
914
914
|
root_meaning: str | None = None
|
|
915
915
|
hint_text: str
|
|
916
916
|
|
|
917
|
+
class SurfaceMorphology(BaseModel):
|
|
918
|
+
category: str # verb_present | verb_other | derived_form | proclitic | enclitic | inflection
|
|
919
|
+
form_key: str | None = None
|
|
920
|
+
explanation: str | None = None
|
|
921
|
+
|
|
922
|
+
|
|
917
923
|
class ConfusionAnalysisOut(BaseModel):
|
|
918
924
|
confusion_type: str | None # "morphological" | "visual" | "both" | None
|
|
919
925
|
surface_form: str
|
|
@@ -921,6 +927,7 @@ class ConfusionAnalysisOut(BaseModel):
|
|
|
921
927
|
lemma_ar: str
|
|
922
928
|
gloss_en: str | None = None
|
|
923
929
|
decomposition: MorphDecomposition | None = None
|
|
930
|
+
morphology: SurfaceMorphology | None = None
|
|
924
931
|
similar_words: list[SimilarWord] = []
|
|
925
932
|
phonetic_similar: list[PhoneticSimilarWord] = []
|
|
926
933
|
prefix_hint: PrefixHint | None = None
|
New Pydantic model for the morphology shape. Previously this data didn't exist in the schema layer at all — it was an ad-hoc dict inside confusion_service. Promoting it to a named schema makes the API contract explicit and enables TypeScript code-gen.
Adds the morphology field to ConfusionAnalysisOut. Before this line, the schema serialized the result dict but silently dropped the morphology key — Pydantic's default behavior for unexpected keys. Now it's an explicit optional field.
The submit_sentence_review() function in sentence_review_service.py maintains variant_stats_json on each UserLemmaKnowledge row — a per-surface-form accounting of how many times a word was seen, missed, or confused. Before this PR, when a surface was missed or confused, the code called _match_surface_form() to find if that surface was a known forms_json entry, and if so stored form_key and form_label for future querying.
_match_surface_form() was a 20-line private helper that did essentially the same forms_json lookup as part of classify_surface_morphology(), but without the verb-tense heuristic, without clitics, and without the inflection catch-all. It represented a partial solution to the same problem.
This PR deletes _match_surface_form() entirely and replaces the call site with classify_surface_morphology(). The replacement stores the richer category field on every entry (not just forms_json matches), and stores form_key/form_label when available (derived_form case). The result: per-form confusion data is now queryable at the category level — you can ask 'how many present-tense confusions does this user have for أَفْسَدَ?' without re-running the classifier on read.
Existing variant_stats_json entries in the database were written by the old _match_surface_form() logic — they may have form_key/form_label but no category. Any analytics query on category must handle NULL/missing for historical rows. This is expected but worth documenting in the migration runbook.
Storing category at write time is a meaningful data model improvement. Previously, understanding why a word was confused required re-running classifier logic on read. Now the category is an indexed fact on the JSON column — research queries like 'group confused surfaces by morphological category' become straightforward.
sentence_review_service now imports from confusion_service. This couples the two services — a change to classify_surface_morphology could affect the write path. The alternative (duplicating the logic or keeping _match_surface_form) was worse, but reviewers should be aware the classifier is now on a hot path (every sentence submission that involves a missed/confused word).
|
@@ -20,30 +20,11 @@
|
|
|
20
20
|
SentenceWord,
|
|
21
21
|
UserLemmaKnowledge,
|
|
22
22
|
)
|
|
23
|
+
from app.services.confusion_service import classify_surface_morphology
|
|
23
24
|
from app.services.fsrs_service import STATE_MAP, parse_json_column, submit_review
|
|
24
25
|
from app.services.grammar_service import record_grammar_exposure
|
|
25
26
|
from app.services.sentence_validator import strip_diacritics, _is_function_word
|
|
26
27
|
|
|
27
|
-
_FORM_METADATA_KEYS = {"gender", "verb_form", "pattern", "notes"}
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
def _match_surface_form(surface_bare: str, lemma: Lemma | None) -> dict | None:
|
|
31
|
-
"""Return the forms_json key matching a tracked surface, when known."""
|
|
32
|
-
if not lemma:
|
|
33
|
-
return None
|
|
34
|
-
forms = parse_json_column(lemma.forms_json)
|
|
35
|
-
if not isinstance(forms, dict):
|
|
36
|
-
return None
|
|
37
|
-
surface_no_al = surface_bare[2:] if surface_bare.startswith("ال") else surface_bare
|
|
38
|
-
for key, value in forms.items():
|
|
39
|
-
if key in _FORM_METADATA_KEYS or not isinstance(value, str) or not value:
|
|
40
|
-
continue
|
|
41
|
-
form_bare = strip_diacritics(value)
|
|
42
|
-
form_no_al = form_bare[2:] if form_bare.startswith("ال") else form_bare
|
|
43
|
-
if surface_bare in (form_bare, form_no_al) or surface_no_al in (form_bare, form_no_al):
|
|
44
|
-
return {"form_key": key, "form_label": key.replace("_", " ")}
|
|
45
|
-
return None
|
|
46
|
-
|
|
47
28
|
|
|
48
29
|
def submit_sentence_review(
|
|
49
30
|
db: Session,
|
|
@@ -371,9 +352,12 @@ def submit_sentence_review(
|
|
|
371
352
|
entry["missed"] = entry.get("missed", 0) + 1
|
|
372
353
|
elif is_confused:
|
|
373
354
|
entry["confused"] = entry.get("confused", 0) + 1
|
|
374
|
-
|
|
355
|
+
morph = classify_surface_morphology(surface_bare, canonical_lemma_obj)
|
|
375
|
-
if
|
|
356
|
+
if morph:
|
|
357
|
+
entry["category"] = morph["category"]
|
|
376
|
-
|
|
358
|
+
if morph.get("form_key"):
|
|
359
|
+
entry["form_key"] = morph["form_key"]
|
|
360
|
+
entry["form_label"] = morph["form_key"].replace("_", " ")
|
|
377
361
|
vstats[surface_bare] = entry
|
|
378
362
|
knowledge.variant_stats_json = vstats
|
|
379
363
|
|
Replaces the import of nothing (the deleted _match_surface_form was local) with an import of classify_surface_morphology from confusion_service. This creates a new inter-service dependency — sentence_review_service now imports from confusion_service.
Replaces the 3-line _match_surface_form call with a call to classify_surface_morphology. The old code stored form_key/form_label only when forms_json matched. The new code always stores category (every inflected form now gets a classification) and conditionally stores form_key/form_label only for derived_form. This is a superset — existing form_key/form_label storage is preserved, and verb conjugation entries now gain a category field.
The confusion_help router endpoint already logs interaction telemetry when a user opens the help panel for a yellow mark. This PR adds morph_category to that log entry — drawn from result.get('morphology', {}).get('category'), which is the category string from the new classifier, or None if the surface is the dictionary form or trivially definite.
This is a small change (one line) but it closes a data loop: the original confusion-capture analysis that motivated this PR was done by parsing interaction logs. Adding morph_category to future logs means subsequent analyses will be able to directly group 'what kinds of inflected forms are causing confusion help to be opened' without re-running the classifier retroactively. The .get('morphology') or {} guard is defensive — it handles the case where morphology is None (the classifier returned None) without raising an AttributeError.
The PR was motivated by an analysis of confusion captures. Logging morph_category means the next analysis can directly measure whether the bridge is reducing confusion rates per morphological category — without needing to re-run the classifier on historical interaction data.
|
@@ -985,6 +985,7 @@ def confusion_help(
|
|
|
985
985
|
],
|
|
986
986
|
phonetic_lemma_ids=[w.get("lemma_id") for w in result.get("phonetic_similar", [])],
|
|
987
987
|
has_decomposition=result.get("decomposition") is not None,
|
|
988
|
+
morph_category=(result.get("morphology") or {}).get("category"),
|
|
988
989
|
)
|
|
989
990
|
|
|
990
991
|
return result
|
Adds morph_category to the confusion_help interaction log call. Previously absent — this field was not logged at all. The (result.get('morphology') or {}).get('category') pattern handles both the case where the key is missing and where its value is None.
The final leg of the pipeline is the frontend. The ConfusionAnalysis interface in frontend/lib/types.ts mirrors the Pydantic ConfusionAnalysisOut schema — it's the TypeScript type that WordInfoCard receives as confusionData. Before this PR, the interface had no morphology field, so even if the backend sent it, TypeScript would reject any access to it.
The SurfaceMorphology interface is added as a proper TypeScript union type for category (not a bare string), providing compile-time safety that the backend's string constants are exhaustively matched. ConfusionAnalysis gains morphology: SurfaceMorphology | null.
In WordInfoCard.tsx, the RevealedView component already destructures confusionData to pull out decomp, similarWords, etc. This PR adds morphExplanation = confusionData?.morphology?.explanation. The UI renders a new morphBridge view only when morphExplanation is non-null — a teal-tinted pill with a git-compare icon and the explanation text. Importantly, this widget appears between the color-band decomposition and the forms strip, so it's visible in exactly the gap where the bands were previously silent.
The morphBridge renders after the decomposition bands and before the FormsStrip. Verify this placement makes sense when BOTH decomposition AND morphology.explanation are present — can a surface form simultaneously have color-band decomposition AND a verb-tense explanation? The classifier logic suggests not (explanation is only set for verb cases where form_key is None, so decomposition would be None too), but it's worth confirming the two never render together.
Typing the category as a TypeScript string literal union (rather than string) is the right pattern here. It means a future refactor that adds or renames a category will produce a compile error on the frontend, making the schema boundary explicit and safe.
|
@@ -1101,6 +1101,18 @@ export interface PrefixHint {
|
|
|
1101
1101
|
hint_text: string;
|
|
1102
1102
|
}
|
|
1103
1103
|
|
|
1104
|
+
export interface SurfaceMorphology {
|
|
1105
|
+
category:
|
|
1106
|
+
| "verb_present"
|
|
1107
|
+
| "verb_other"
|
|
1108
|
+
| "derived_form"
|
|
1109
|
+
| "proclitic"
|
|
1110
|
+
| "enclitic"
|
|
1111
|
+
| "inflection";
|
|
1112
|
+
form_key: string | null;
|
|
1113
|
+
explanation: string | null;
|
|
1114
|
+
}
|
|
1115
|
+
|
|
1104
1116
|
export interface ConfusionAnalysis {
|
|
1105
1117
|
confusion_type: "morphological" | "visual" | "both" | null;
|
|
1106
1118
|
surface_form: string;
|
|
@@ -1108,6 +1120,7 @@ export interface ConfusionAnalysis {
|
|
|
1108
1120
|
lemma_ar: string;
|
|
1109
1121
|
gloss_en: string | null;
|
|
1110
1122
|
decomposition: MorphDecomposition | null;
|
|
1123
|
+
morphology: SurfaceMorphology | null;
|
|
1111
1124
|
similar_words: SimilarWordItem[];
|
|
1112
1125
|
phonetic_similar: PhoneticSimilarItem[];
|
|
1113
1126
|
prefix_hint: PrefixHint | null;
|
New SurfaceMorphology interface with a union-typed category field. Previously this type did not exist — the morphology data would have been any or ignored. The union type means TypeScript will error if a new category is added to the backend without updating the frontend.
Adds morphology to ConfusionAnalysis. Previously absent — accessing confusionData?.morphology would have been a TypeScript error (or typed as any if the interface was loose).
|
@@ -337,6 +337,7 @@ function RevealedView({
|
|
|
337
337
|
});
|
|
338
338
|
|
|
339
339
|
const decomp = confusionData?.decomposition;
|
|
340
|
+
const morphExplanation = confusionData?.morphology?.explanation;
|
|
340
341
|
const similarWords = confusionData?.similar_words;
|
|
341
342
|
const phoneticSimilar = confusionData?.phonetic_similar;
|
|
342
343
|
const prefixHint = confusionData?.prefix_hint;
|
|
@@ -563,6 +564,15 @@ function RevealedView({
|
|
|
563
564
|
);
|
|
564
565
|
})()}
|
|
565
566
|
|
|
567
|
+
{/* Morphology bridge — one-line surface->lemma link for verb-tense forms the
|
|
568
|
+
color bands can't decompose (e.g. "present-tense form of «to spoil»"). */}
|
|
569
|
+
{morphExplanation && (
|
|
570
|
+
<View style={styles.morphBridge}>
|
|
571
|
+
<Ionicons name="git-compare-outline" size={13} color={colors.accent} />
|
|
572
|
+
<Text style={styles.morphBridgeText}>{morphExplanation}</Text>
|
|
573
|
+
</View>
|
|
574
|
+
)}
|
|
575
|
+
|
|
566
576
|
{/* Forms strip */}
|
|
567
577
|
<FormsStrip
|
|
568
578
|
pos={result.pos}
|
|
@@ -1223,6 +1233,21 @@ const styles = StyleSheet.create({
|
|
|
1223
1233
|
fontSize: 11,
|
|
1224
1234
|
color: colors.textSecondary,
|
|
1225
1235
|
},
|
|
1236
|
+
morphBridge: {
|
|
1237
|
+
flexDirection: "row",
|
|
1238
|
+
alignItems: "center",
|
|
1239
|
+
gap: 6,
|
|
1240
|
+
backgroundColor: "rgba(100, 140, 180, 0.06)",
|
|
1241
|
+
borderRadius: 10,
|
|
1242
|
+
paddingHorizontal: 12,
|
|
1243
|
+
paddingVertical: 8,
|
|
1244
|
+
},
|
|
1245
|
+
morphBridgeText: {
|
|
1246
|
+
flex: 1,
|
|
1247
|
+
fontSize: 13,
|
|
1248
|
+
color: colors.text,
|
|
1249
|
+
fontFamily: fontFamily.translitRegular,
|
|
1250
|
+
},
|
|
1226
1251
|
});
|
|
1227
1252
|
|
|
1228
1253
|
const cfStyles = StyleSheet.create({
|
Extracts morphExplanation from confusionData alongside the existing decomp/similarWords destructuring. Previously this variable didn't exist. The optional-chaining pattern matches the nullable field type.
New conditional morphBridge widget inserted between the decomposition color bands and the FormsStrip. Previously this area showed nothing for verb conjugations — the learner who failed to recognize يُفْسِدُ got no bridge to أَفْسَدَ. Now they see 'present-tense form of "to spoil" (أَفْسَدَ)'. The git-compare-outline icon is a deliberate choice — it visually suggests a transformation/mapping relationship.
New StyleSheet entries for the morphBridge widget. The rgba(100, 140, 180, 0.06) background is a very subtle blue tint — distinct from the decomposition bands but not competing for attention. Using fontFamily.translitRegular is appropriate since the explanation text is Latin-script transliteration context.
The test suite grows in two places. TestClassifySurfaceMorphology in test_confusion_service.py tests the new function in isolation using a _mk_lemma MagicMock factory — this avoids needing a database and keeps the tests fast. The 11 cases cover: identity (returns None), bare definite (suppressed), present-tense detection, present-tense after a proclitic (li + present), the critical Form-IV false-positive guard, past-tense conjugation (verb_other), derived form from forms_json, broken plural from forms_json, proclitic noun, unknown inflection catch-all, and None-safety.
TestVariantStatsMorphology in test_sentence_review.py tests the write path end-to-end: it seeds a real verb lemma with a present-tense surface in a sentence, runs submit_sentence_review with that verb as confused, and asserts that variant_stats_json on the resulting UserLemmaKnowledge record contains category == 'verb_present'. A second test covers the derived_form + form_key case with a plural. These integration tests give confidence that the classifier is correctly called (not just correct in isolation) and that the database serialization round-trip preserves the data.
test_confused_present_verb_records_category seeds the sentence with surface_form='يُفْسِدُ' (diacritized) but the variant_stats_json key is looked up as 'يفسد' (bare). Verify that the surface_bare stripping in submit_sentence_review handles this correctly — this is an implicit dependency on strip_diacritics being called before the variant_stats lookup.
Using MagicMock for the unit tests and real DB fixtures for the integration tests is a good layered strategy. The unit tests run fast and cover combinatorial cases; the integration tests prove the wiring is correct. Reviewers should check that the MagicMock's getattr(lemma, 'forms_json', None) path is exercised — it's used inside classify_surface_morphology to safely access forms_json on mock objects.
|
@@ -509,3 +511,80 @@ def test_no_phonetic_for_visual_match(self):
|
|
|
509
511
|
db, 1, "كلب", {10}, candidates=[(word, "known")],
|
|
510
512
|
)
|
|
511
513
|
assert len(results) == 0
|
|
514
|
+
|
|
515
|
+
|
|
516
|
+
def _mk_lemma(lemma_ar, bare, pos, gloss="x", forms=None):
|
|
517
|
+
m = MagicMock()
|
|
518
|
+
m.lemma_ar = lemma_ar
|
|
519
|
+
m.lemma_ar_bare = bare
|
|
520
|
+
m.pos = pos
|
|
521
|
+
m.gloss_en = gloss
|
|
522
|
+
m.forms_json = forms
|
|
523
|
+
return m
|
|
524
|
+
|
|
525
|
+
|
|
526
|
+
class TestClassifySurfaceMorphology:
|
|
527
|
+
def test_identity_returns_none(self):
|
|
528
|
+
lem = _mk_lemma("سَيّارة", "سيارة", "noun", "car")
|
|
529
|
+
assert classify_surface_morphology("سيارة", lem) is None
|
|
530
|
+
|
|
531
|
+
def test_definite_only_suppressed(self):
|
|
532
|
+
# ال + stem == lemma is trivial; no bridge.
|
|
533
|
+
lem = _mk_lemma("سَيّارة", "سيارة", "noun", "car")
|
|
534
|
+
assert classify_surface_morphology("السيارة", lem) is None
|
|
535
|
+
|
|
536
|
+
def test_verb_present_not_in_forms(self):
|
|
537
|
+
# يفسد is the present of أفسد; forms_json lacks it -> still classified.
|
|
538
|
+
lem = _mk_lemma("أَفْسَدَ", "افسد", "verb", "to spoil")
|
|
539
|
+
out = classify_surface_morphology("يفسد", lem)
|
|
540
|
+
assert out["category"] == "verb_present"
|
|
541
|
+
assert "present-tense" in out["explanation"]
|
|
542
|
+
assert "to spoil" in out["explanation"]
|
|
543
|
+
|
|
544
|
+
def test_verb_present_after_proclitic(self):
|
|
545
|
+
# لِيُعْطِيَ -> li + present of أعطى
|
|
546
|
+
lem = _mk_lemma("أَعْطَى", "اعطى", "verb", "to give")
|
|
547
|
+
out = classify_surface_morphology("ليعطي", lem)
|
|
548
|
+
assert out["category"] == "verb_present"
|
|
549
|
+
|
|
550
|
+
def test_form_iv_past_is_identity_not_present(self):
|
|
551
|
+
# The lemma itself (أفسد) must never classify as its own present tense.
|
|
552
|
+
lem = _mk_lemma("أَفْسَدَ", "افسد", "verb", "to spoil")
|
|
553
|
+
assert classify_surface_morphology("افسد", lem) is None # identity
|
|
554
|
+
|
|
555
|
+
def test_verb_other_conjugation(self):
|
|
556
|
+
lem = _mk_lemma("وَقَعَ", "وقع", "verb", "to happen")
|
|
557
|
+
out = classify_surface_morphology("وقعت", lem) # past 3fs
|
|
558
|
+
assert out["category"] == "verb_other"
|
|
559
|
+
assert out["explanation"] is not None
|
|
560
|
+
|
|
561
|
+
def test_derived_form_matches_forms_json(self):
|
|
562
|
+
lem = _mk_lemma("خَطَّط", "خطط", "verb", "to plan", {"masdar": "تَخْطِيط"})
|
|
563
|
+
out = classify_surface_morphology("التخطيط", lem)
|
|
564
|
+
assert out["category"] == "derived_form"
|
|
565
|
+
assert out["form_key"] == "masdar"
|
|
566
|
+
# bands render this case, so no redundant explanation line
|
|
567
|
+
assert out["explanation"] is None
|
|
568
|
+
|
|
569
|
+
def test_plural_derived_form(self):
|
|
570
|
+
lem = _mk_lemma("وَرَقَة", "ورقة", "noun", "paper", {"plural": "أَوْرَاق"})
|
|
571
|
+
# surface bare keeps hamza (strip_diacritics does not normalize it), matching forms_json
|
|
572
|
+
out = classify_surface_morphology(strip_diacritics("أَوْرَاق"), lem)
|
|
573
|
+
assert out["category"] == "derived_form"
|
|
574
|
+
assert out["form_key"] == "plural"
|
|
575
|
+
|
|
576
|
+
def test_proclitic_noun(self):
|
|
577
|
+
lem = _mk_lemma("نُور", "نور", "noun", "light")
|
|
578
|
+
out = classify_surface_morphology("بنور", lem) # bi- + light
|
|
579
|
+
assert out["category"] == "proclitic"
|
|
580
|
+
assert out["explanation"] is None
|
|
581
|
+
|
|
582
|
+
def test_inflection_unknown(self):
|
|
583
|
+
# Surface differs, decompose can't explain, not a verb -> bridge-less category.
|
|
584
|
+
lem = _mk_lemma("أَبْيَض", "ابيض", "adj", "white")
|
|
585
|
+
out = classify_surface_morphology("بيضاء", lem)
|
|
586
|
+
assert out["category"] == "inflection"
|
|
587
|
+
assert out["explanation"] is None
|
|
588
|
+
|
|
589
|
+
def test_none_lemma_safe(self):
|
|
590
|
+
assert classify_surface_morphology("xyz", None) is None
|
New _mk_lemma helper — creates a MagicMock Lemma with controllable fields. Previously absent; the existing test class used real DB fixtures. This lightweight factory enables fast in-memory unit tests for the new classifier.
TestClassifySurfaceMorphology — 11 test cases covering the full decision tree of the new classifier. The test_form_iv_past_is_identity_not_present case is particularly important: it's the edge case the PR description calls out explicitly, and it validates the not-lemma-starts-with guard.
|
@@ -345,6 +345,78 @@ def capture_log(**kwargs):
|
|
|
345
345
|
assert events[-1]["confusion_candidate_lemma_ids"] == {2: [1]}
|
|
346
346
|
|
|
347
347
|
|
|
348
|
+
class TestVariantStatsMorphology:
|
|
349
|
+
"""The confused/missed write path classifies the surface form and stores
|
|
350
|
+
category + form_key on the canonical ULK's variant_stats_json (PR: morphology bridge)."""
|
|
351
|
+
|
|
352
|
+
def _seed_verb_sentence(self, db, surface):
|
|
353
|
+
# primary noun + a verb whose sentence surface is an inflected form
|
|
354
|
+
_seed_word(db, 1, "كتاب", "book")
|
|
355
|
+
verb = Lemma(
|
|
356
|
+
lemma_id=2, lemma_ar="أَفْسَدَ", lemma_ar_bare="أفسد",
|
|
357
|
+
pos="verb", gloss_en="to spoil",
|
|
358
|
+
)
|
|
359
|
+
db.add(verb)
|
|
360
|
+
db.flush()
|
|
361
|
+
db.add(UserLemmaKnowledge(
|
|
362
|
+
lemma_id=2, knowledge_state="learning", fsrs_card_json=_make_card(),
|
|
363
|
+
introduced_at=datetime.now(timezone.utc) - timedelta(days=10),
|
|
364
|
+
last_reviewed=datetime.now(timezone.utc) - timedelta(hours=1),
|
|
365
|
+
source="study",
|
|
366
|
+
))
|
|
367
|
+
sent = Sentence(id=1, arabic_text="x", english_translation="x",
|
|
368
|
+
target_lemma_id=1, mappings_verified_at=datetime.now(timezone.utc))
|
|
369
|
+
db.add(sent)
|
|
370
|
+
db.flush()
|
|
371
|
+
db.add(SentenceWord(sentence_id=1, position=0, surface_form=surface, lemma_id=2))
|
|
372
|
+
db.add(SentenceWord(sentence_id=1, position=1, surface_form="كتاب", lemma_id=1))
|
|
373
|
+
db.flush()
|
|
374
|
+
db.commit()
|
|
375
|
+
|
|
376
|
+
def test_confused_present_verb_records_category(self, db_session):
|
|
377
|
+
self._seed_verb_sentence(db_session, "يُفْسِدُ") # present tense of أفسد
|
|
378
|
+
submit_sentence_review(
|
|
379
|
+
db_session, sentence_id=1, primary_lemma_id=1,
|
|
380
|
+
comprehension_signal="partial", confused_lemma_ids=[2], session_id="t",
|
|
381
|
+
)
|
|
382
|
+
ulk = db_session.query(UserLemmaKnowledge).filter_by(lemma_id=2).first()
|
|
383
|
+
entry = ulk.variant_stats_json["يفسد"]
|
|
384
|
+
assert entry["confused"] == 1
|
|
385
|
+
assert entry["category"] == "verb_present"
|
|
386
|
+
|
|
387
|
+
def test_derived_form_records_form_key(self, db_session):
|
|
388
|
+
# plural in forms_json -> derived_form with form_key
|
|
389
|
+
_seed_word(db_session, 1, "كتاب", "book")
|
|
390
|
+
noun = Lemma(
|
|
391
|
+
lemma_id=2, lemma_ar="وَرَقَة", lemma_ar_bare="ورقة", pos="noun",
|
|
392
|
+
gloss_en="paper", forms_json={"plural": "أَوْرَاق"},
|
|
393
|
+
)
|
|
394
|
+
db_session.add(noun)
|
|
395
|
+
db_session.flush()
|
|
396
|
+
db_session.add(UserLemmaKnowledge(
|
|
397
|
+
lemma_id=2, knowledge_state="learning", fsrs_card_json=_make_card(),
|
|
398
|
+
introduced_at=datetime.now(timezone.utc) - timedelta(days=10),
|
|
399
|
+
source="study",
|
|
400
|
+
))
|
|
401
|
+
sent = Sentence(id=1, arabic_text="x", english_translation="x",
|
|
402
|
+
target_lemma_id=1, mappings_verified_at=datetime.now(timezone.utc))
|
|
403
|
+
db_session.add(sent)
|
|
404
|
+
db_session.flush()
|
|
405
|
+
db_session.add(SentenceWord(sentence_id=1, position=0, surface_form="أَوْرَاق", lemma_id=2))
|
|
406
|
+
db_session.add(SentenceWord(sentence_id=1, position=1, surface_form="كتاب", lemma_id=1))
|
|
407
|
+
db_session.flush()
|
|
408
|
+
db_session.commit()
|
|
409
|
+
|
|
410
|
+
submit_sentence_review(
|
|
411
|
+
db_session, sentence_id=1, primary_lemma_id=1,
|
|
412
|
+
comprehension_signal="partial", confused_lemma_ids=[2], session_id="t",
|
|
413
|
+
)
|
|
414
|
+
ulk = db_session.query(UserLemmaKnowledge).filter_by(lemma_id=2).first()
|
|
415
|
+
entry = ulk.variant_stats_json["أوراق"]
|
|
416
|
+
assert entry["category"] == "derived_form"
|
|
417
|
+
assert entry["form_key"] == "plural"
|
|
418
|
+
|
|
419
|
+
|
|
348
420
|
class TestNoIdea:
|
|
349
421
|
def test_all_words_get_rating_1(self, db_session):
|
|
350
422
|
_seed_word(db_session, 1, "كتاب", "book")
|
TestVariantStatsMorphology — two integration tests for the write path. These test that classify_surface_morphology is wired in correctly (not just that it returns the right value in isolation) and that variant_stats_json is persisted with the category. Previously there were no tests for the _match_surface_form logic at this integration level.
These files were changed in the PR but not featured in the walkthrough above.
|
@@ -20,7 +20,7 @@ Full endpoint list. See `backend/app/routers/` for implementation.
|
|
|
20
20
|
| POST | `/api/review/submit-sentence` | Submit sentence review. Schedulable content lemmas get FSRS/acquisition credit; function words and proper-name lemmas are lookup-only and ignored for scheduling/review credit. Accepts `missed_lemma_ids`, `confused_lemma_ids`, optional `confusion_candidate_lemma_ids` telemetry from the yellow-tap help panel, and optional `confusion_captures` (array of `{failed_lemma_id, capture_method: 'suggested_pick' \| 'free_text', confused_with_lemma_id?, confused_with_text?, candidates_shown}`) — explicit user-picked confusions persisted to the `confusion_captures` table for later analysis. Optional `parent_card_type` (`"passage"`/`"sentence"`/`"wrapup"`) tags the review with its parent card so analytics can split passage-internal reviews from standalone ones. |
|
|
21
21
|
| POST | `/api/review/undo-sentence` | Undo a sentence review — restores pre-review FSRS state, deletes logs |
|
|
22
22
|
| GET | `/api/review/word-lookup/{lemma_id}` | Word detail + root family + forms_translit (computed on-the-fly if not stored) + pattern_examples + etymology_json for review lookup |
|
|
23
|
-
| GET | `/api/review/confusion-help/{lemma_id}?surface_form=...` | Confusion analysis for "did not recognize" words — morphological decomposition (clitics/forms) + form-aware visual similarity (surface/form edit distance, rasm, short-verb ranking) + phonetic similarity |
|
|
23
|
+
| GET | `/api/review/confusion-help/{lemma_id}?surface_form=...` | Confusion analysis for "did not recognize" words — morphological decomposition (clitics/forms) + `morphology` `{category, form_key, explanation}` surface→lemma bridge (incl. verb-tense forms the band decomposition can't show) + form-aware visual similarity (surface/form edit distance, rasm, short-verb ranking) + phonetic similarity |
|
|
24
24
|
| POST | `/api/review/sync` | Bulk sync offline reviews |
|
|
25
25
|
| POST | `/api/review/reintro-result` | Submit re-introduction quiz result |
|
|
26
26
|
| POST | `/api/review/experiment-intro-ack` | Acknowledge experiment intro card was shown (sets `experiment_intro_shown_at` for dedup + rescue cooldown) |
|
|
@@ -60,7 +60,7 @@ All services in `backend/app/services/`.
|
|
|
60
60
|
- `morphology.py` — CAMeL Tools analyzer. Hamza normalized at comparison time only (preserved in storage). Falls back to stub if not installed.
|
|
61
61
|
- `transliteration.py` — Deterministic Arabic→ALA-LC romanization from diacritized text. Handles long vowels, shadda, hamza carriers, alif madda/wasla, sun letter assimilation, tāʾ marbūṭa, nisba ending. **Uthmani diacritics**: recognizes U+06E1 (small high dotless head of khaa / Uthmani sukun), U+06DF (small high rounded zero), U+06E2 (small high meem) so Quranic text transliterates correctly. **Long-vowel inference for partially-vocalized text** (fixed 2026-05-04): bare ya/waw following a vowelless consonant infers long ī/ū (e.g. `حَديقة` → `ḥadīqa`, `إيجار` → `ījār`), mirroring the existing bare-alif → long ā logic. Word-initial hamza-carriers (إ ا أ ٱ) handle long ī/ū the same way. **Consonant-glide disambiguation**: a ya/waw is treated as a consonant — not a long-vowel marker — when (a) it carries its own short vowel (e.g. `سِيَاسَة` → `siyāsa`, not `sīāsa`) or (b) it's immediately followed by alif/maqsura (e.g. `حَالِياً` → `ḥāliyā`, not `ḥālīā`), since Arabic phonotactics disallow two adjacent long vowels. `transliterate_lemma()` for dictionary form (strips tanwīn + case vowels). `transliterate_forms()` iterates forms_json values and produces parallel ALA-LC transliterations (skips metadata keys like "gender", "verb_form").
|
|
62
62
|
- `variant_detection.py` — Three-layer variant detection: (1) CAMeL candidates with root_id validation (rejects different-root pairs), (2) Gemini Flash LLM confirmation with VariantDecision cache, (3) display fix in sentence_selector uses original lemma_id. Used by ALL import paths. Graceful fallback if LLM unavailable.
|
|
63
|
-
- `confusion_service.py` — Rule-based confusion analysis for "did not recognize" (yellow) words. Four analysis types: (1) **morphological** — decomposes surface form into prefix clitics + stem + suffix clitics using PROCLITICS/ENCLITICS lists, matches stem against lemma and forms_json entries; (2) **visual/form-aware** — finds similar-looking words in user's vocabulary (including encountered and suspended leech words) by comparing the target dictionary form and exposed surface form against candidate dictionary forms and `forms_json` entries, then ranks by edit distance, rasm skeleton distance, same-root signal, short-verb priority, **adjacent transposition** (metathesis, e.g. جرح↔جحر — same letters reordered, which plain Levenshtein scores as distance 2; reason "letters swapped"), and **shared rime** (same final letters, different onset — e.g. نام/صام, حرث/ورث; reason "rhymes" — pulls the rhyme cohort above equidistant dot-variants so the user's near-miss isn't truncated; added 2026-06-01 after free-text capture analysis showed these confusions were in vocab but ranked out of the list). Rasm groups map letters differing only by dots to same skeleton (ب/ت/ث/ن → same base). The response includes `match_reason`, `matched_form`, and matched form key for diagnostics; (3) **phonetic** — finds words that sound similar to learners but look different via `PHONETIC_MAP` (emphatic→plain: ص→س, ض→د, ط→ت, ظ→ذ; pharyngeal: ح→ه, ع→ا; interdental: ث→س, ذ→ز; uvular: غ→خ). Catches confusions like سبع↔صباح. Only surfaces words NOT already in visual results; (4) **prefix disambiguation** — when a word starts with و/ف/ب/ل/ك, hints whether it's a proclitic prefix or part of the root (uses `lemma.root` relationship). All rule-based, no LLM. Endpoint: `GET /api/review/confusion-help/{lemma_id}?surface_form=...`.
|
|
63
|
+
- `confusion_service.py` — Rule-based confusion analysis for "did not recognize" (yellow) words. Four analysis types: (1) **morphological** — decomposes surface form into prefix clitics + stem + suffix clitics using PROCLITICS/ENCLITICS lists, matches stem against lemma and forms_json entries; (2) **visual/form-aware** — finds similar-looking words in user's vocabulary (including encountered and suspended leech words) by comparing the target dictionary form and exposed surface form against candidate dictionary forms and `forms_json` entries, then ranks by edit distance, rasm skeleton distance, same-root signal, short-verb priority, **adjacent transposition** (metathesis, e.g. جرح↔جحر — same letters reordered, which plain Levenshtein scores as distance 2; reason "letters swapped"), and **shared rime** (same final letters, different onset — e.g. نام/صام, حرث/ورث; reason "rhymes" — pulls the rhyme cohort above equidistant dot-variants so the user's near-miss isn't truncated; added 2026-06-01 after free-text capture analysis showed these confusions were in vocab but ranked out of the list). Rasm groups map letters differing only by dots to same skeleton (ب/ت/ث/ن → same base). The response includes `match_reason`, `matched_form`, and matched form key for diagnostics; (3) **phonetic** — finds words that sound similar to learners but look different via `PHONETIC_MAP` (emphatic→plain: ص→س, ض→د, ط→ت, ظ→ذ; pharyngeal: ح→ه, ع→ا; interdental: ث→س, ذ→ز; uvular: غ→خ). Catches confusions like سبع↔صباح. Only surfaces words NOT already in visual results; (4) **prefix disambiguation** — when a word starts with و/ف/ب/ل/ك, hints whether it's a proclitic prefix or part of the root (uses `lemma.root` relationship). All rule-based, no LLM. Endpoint: `GET /api/review/confusion-help/{lemma_id}?surface_form=...`. **`classify_surface_morphology(surface_bare, lemma)`** (2026-06-03) is the shared classifier behind the morphology bridge: returns `{category, form_key, explanation}` (None for the dictionary form or a bare definite article). `category` ∈ verb_present/verb_other/derived_form/proclitic/enclitic/inflection. `explanation` is a one-line surface→lemma bridge ("present-tense form of «to spoil»") populated only for the verb-tense cases `decompose_surface` can't render as color bands — closing the ~55% of inflected confusions (esp. conjugations absent from `forms_json`) the bands missed. `analyze_confusion` returns it under `morphology`, the `submit-sentence` write path stores `category`/`form_key` on `variant_stats_json`, and `WordInfoCard` renders the `explanation` line on a yellow mark.
|
|
64
64
|
- `grammar_service.py` — 49 features, 8 tiers. Comfort score: 60% log-exposure + 40% accuracy, decayed by recency.
|
|
65
65
|
- `grammar_tagger.py` — LLM-based grammar feature tagging.
|
|
66
66
|
- `grammar_lesson_service.py` — LLM-generated grammar lessons, cached in DB.
|
|
@@ -7,7 +7,7 @@ SQLAlchemy models in `backend/app/models.py`. Pydantic schemas in `backend/app/s
|
|
|
7
7
|
- `pattern_info` — Morphological pattern metadata: wazn (PK, e.g. "fa'il"), wazn_meaning, enrichment_json (LLM-generated: explanation, how_to_recognize, semantic_fields, example_derivations, register_notes, fun_facts, related_patterns)
|
|
8
8
|
- `lemmas` — Dictionary forms: root FK, pos, gloss, frequency_rank, cefr_level, grammar_features_json, forms_json, example_ar/en, transliteration, audio_url, canonical_lemma_id (variant FK), source_story_id, word_category (NULL=standard, proper_name, onomatopoeia), thematic_domain, etymology_json, memory_hooks_json, wazn (morphological pattern e.g. "fa'il", "maf'ul", "form_2", indexed), wazn_meaning (human-readable pattern description), forms_translit_json (ALA-LC transliteration per forms_json key, e.g. {"present": "yaktub", "plural": "kutub"}), gates_completed_at (timestamp set by `run_quality_gates()` — NULL means ungated, session builder rejects), decomposition_note (nullable JSON audit metadata from lemma-decomposition audit: `{mle_misanalysis: bool, reason, source_artifact, tagged_at, phase}` — stamped by Step 4b+ on orphan compounds whose CAMeL MLE decomposition proved wrong; query: `json_extract(decomposition_note, '$.mle_misanalysis') = 1`)
|
|
9
9
|
- `frequency_core_entries` — Weighted high-frequency curriculum ranks. `core_rank` is a continuous teachable-content rank; `lemma_id` links to an Alif lemma when mapped and stays NULL for honest missing-from-DB gaps. Stores source evidence (`camel_rank/count`, `buckwalter_rank`, `artenten_rank`, `kelly_rank/cefr`, `hindawi_rank`, `news_rank`, `islamic_rank`, `broad_source_count`, `confidence_tier`, `gap_status`, `source_flags_json`) plus display/gloss fields for stats.
|
|
10
|
-
- `user_lemma_knowledge` — Per-lemma SRS state: knowledge_state (encountered/acquiring/new/learning/known/lapsed/suspended), fsrs_card_json, times_seen, times_correct, times_heard (passive listening count, incremented by mark-story-heard), total_encounters, source (study/duolingo/textbook_scan/book/story_import/frequency_core/auto_intro/collateral/leech_reintro — preserved through acquisition, not overwritten), variant_stats_json (diagnostic per-surface seen/missed/confused counts;
|
|
10
|
+
- `user_lemma_knowledge` — Per-lemma SRS state: knowledge_state (encountered/acquiring/new/learning/known/lapsed/suspended), fsrs_card_json, times_seen, times_correct, times_heard (passive listening count, incremented by mark-story-heard), total_encounters, source (study/duolingo/textbook_scan/book/story_import/frequency_core/auto_intro/collateral/leech_reintro — preserved through acquisition, not overwritten), variant_stats_json (diagnostic per-surface seen/missed/confused counts; each entry also stores a `category` — verb_present/verb_other/derived_form/proclitic/enclitic/inflection, from `confusion_service.classify_surface_morphology` — plus `form_key`/`form_label` when the surface matches a `forms_json` form; lets per-form confusion be queried instead of re-decomposed; never an independent scheduling unit), acquisition_box (1/2/3), acquisition_next_due, entered_acquiring_at (when word entered Leitner pipeline), graduated_at, leech_suspended_at, leech_count, experiment_group (nullable, `intro_ab_card` for standard card-first acquisition; legacy `textbook_preserve_intro` rows may exist but no longer generate cards), experiment_intro_shown_at (nullable, timestamp when intro card was shown — prevents re-showing)
|
|
11
11
|
|
|
12
12
|
## Sentences & Reviews
|
|
13
13
|
- `sentences` — Generated/imported: arabic_text (fully diacritized — all pipelines store the voweled form; callers needing plain text strip diacritics at query time), english_translation, transliteration, target_lemma_id, story_id (FK to stories, for book-extracted sentences), source (llm/book/corpus/michel_thomas/tatoeba/manual), times_shown, last_reading_shown_at/last_listening_shown_at, last_reading_comprehension/last_listening_comprehension, is_active, max_word_count, created_at, page_number (for book sentences), mappings_verified_at (nullable DateTime — NULL=never verified, timestamp=when last verified by batch LLM check)
|