[#44] Normalize refs table to deduplicate property strings by SkowronskiAndrew · Pull Request #80 · Unity-Technologies/UnityDataTools · GitHub
Skip to content

[#44] Normalize refs table to deduplicate property strings#80

Merged
SkowronskiAndrew merged 4 commits into
mainfrom
issue44-reduce-ref-table-size
Jul 3, 2026
Merged

[#44] Normalize refs table to deduplicate property strings#80
SkowronskiAndrew merged 4 commits into
mainfrom
issue44-reduce-ref-table-size

Conversation

@SkowronskiAndrew

@SkowronskiAndrew SkowronskiAndrew commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes #44.

On large builds the refs table dominates the analyze database — it repeated the property_path and property_type strings on every one of its (often millions of) rows. This normalizes those strings into two lookup tables so refs stores only integer ids, and exposes a refs_view that rejoins them so existing string-based queries barely change.

On a large customer project (5.8M refs rows) this reduces the database from 515 MB to 351 MB (-32%) with negligible impact on analyze time.

⚠️ Breaking schema change — how to adapt existing queries

This changes the refs table schema, so any tooling or SQL that reads refs.property_path / refs.property_type directly needs a small update. Those two columns are no longer text — they now hold integer ids into the new property_names / property_types tables.

Recommended fix: query refs_view instead of refs. The view has the same column names and the same text values the old refs table had (object, referenced_object, property_path, property_type), so in most cases it's a one-word change:

-- before
SELECT object, referenced_object, property_path, property_type
FROM refs
WHERE property_type = 'Texture2D';

-- after
SELECT object, referenced_object, property_path, property_type
FROM refs_view                    -- was: refs
WHERE property_type = 'Texture2D';

If you need to join refs to other tables yourself (e.g. for performance), resolve the strings through the lookup tables:

SELECT r.object, r.referenced_object, pn.name AS property_path, pt.name AS property_type
FROM refs r
JOIN property_names pn ON r.property_path = pn.id
JOIN property_types pt ON r.property_type = pt.id;

Queries that only use object / referenced_object (e.g. dependency traversal) are unaffected.

Databases now carry a schema version: PRAGMA user_version returns 1 for this format and 0 for databases produced by older builds. Tools can check it to fail fast (as find-refs now does) rather than hitting confusing errors. There is no in-place migration — regenerate the database by re-running analyze.

Changes

Schema (Analyzer/Resources/)

  • New property_names and property_types lookup tables (id, name). refs.property_path / property_type change from TEXT to INTEGER ids into those tables.
  • New refs_view reproduces the pre-change shape (object, referenced_object, property_path, property_type as text) so consumers read the strings without joining by hand.
  • Added PRAGMA user_version = 1 — the first formal schema version (older databases report 0).
  • view_material_shader_refs, view_material_texture_refs and script_object_view now read via refs_view.

Populating during analyze

  • SerializedFileSQLiteWriter assigns string ids with two IdProviders and writes each lookup row once (new AddPropertyName / AddPropertyType commands, following the AddType pattern).

find-refs

  • Reads references through refs_view; resolves the m_Script path to its id once per run rather than per object.
  • Rejects databases older than schema version 1 with a clear "unsupported schema version — re-run analyze" message and non-zero exit, instead of an obscure SQL error.

Bonus fix

  • view_material_texture_refs used double-quoted string literals ("Texture", "Material") that SQLite parses as identifiers, so the view errored. This was pre-existing (it failed on old databases too); fixed to single-quoted literals while rewriting the view.

Testing

  • dotnet test — full suite green (UnityFileSystem 425, Analyzer 62, UnityDataTool 234).
  • Fixed the one impacted test (BuildReportTestsrefs_view).
  • Added FindRefsTests coverage: refs_view round-trip, lookup-table consistency, and the unsupported-schema-version rejection.
  • Manually verified on the large customer project: 32% smaller DB, find-refs renders property paths correctly (e.g. m_SavedProperties.m_TexEnvs[2].second.m_Texture), and running find-refs against a pre-change database exits 1 with the friendly message.

Major version bump because the analyze schema will get some changes.
The refs table repeated the property_path/property_type strings on every
row, making it by far the largest part of the analyze database. Store the
strings once in new property_names/property_types lookup tables and keep only
integer ids in refs. A refs_view rejoins them to expose the original strings.

On a large customer project this reduces the database from 515 MB to 351 MB
(-32%) with negligible impact on analyze time.

Also:
- Add PRAGMA user_version = 1; find-refs now rejects older databases with a
  clear "unsupported schema version" message instead of an obscure SQL error.
- Update the material/monoscript views and find-refs queries to read the
  strings via refs_view (or resolve ids once).
- Fix view_material_texture_refs, which used double-quoted string literals
  that SQLite parsed as identifiers (pre-existing, failed on old databases too).
@SkowronskiAndrew SkowronskiAndrew requested a review from Copilot July 3, 2026 13:04
@SkowronskiAndrew SkowronskiAndrew marked this pull request as ready for review July 3, 2026 13:06

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Normalizes the analyzer SQLite schema to significantly reduce database size by deduplicating refs.property_path and refs.property_type strings into lookup tables, while preserving query ergonomics via a refs_view. Updates find-refs, tests, and docs to use the new view and introduces schema versioning to gate incompatible databases.

Changes:

  • Normalized refs by introducing property_names / property_types lookup tables and a refs_view to preserve the pre-change shape for consumers.
  • Updated analyze writer to populate lookup tables once per unique string and store integer ids in refs.
  • Updated find-refs, SQL views, tests, and documentation; added PRAGMA user_version = 1 and rejection of older schemas in find-refs.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

Show a summary per file
File Description
UnityDataTool/UnityDataTool.csproj Bumps tool version to 2.0.0 to reflect the schema/API change.
UnityDataTool.Tests/FindRefsTests.cs Adds coverage for refs_view, lookup-table consistency, and old-schema rejection behavior.
UnityDataTool.Tests/BuildReportTests.cs Updates BuildReport reference assertions to query refs_view for string columns.
ReferenceFinder/ReferenceFinderTool.cs Switches find-refs to read from refs_view, resolves m_Script path id once, and rejects unsupported schema versions.
Documentation/contentlayout.md Fixes Unity manual link target for build report history page.
Documentation/analyzer.md Documents the normalized refs schema, lookup tables, and preferred querying via refs_view.
Documentation/analyze-examples.md Updates example query to use refs_view instead of refs.
Analyzer/SQLite/Writers/SerializedFileSQLiteWriter.cs Adds id providers + one-time inserts to populate property_names / property_types and writes ids into refs.
Analyzer/SQLite/Commands/SerializedFile/AddReference.cs Updates parameter types for refs.property_path / property_type to integers.
Analyzer/SQLite/Commands/SerializedFile/AddPropertyType.cs Adds SQLite command for inserting rows into property_types.
Analyzer/SQLite/Commands/SerializedFile/AddPropertyName.cs Adds SQLite command for inserting rows into property_names.
Analyzer/Resources/MonoScript.sql Updates script_object_view to join via refs_view.
Analyzer/Resources/Init.sql Adds lookup tables, refs_view, updates dependent views to use refs_view, fixes SQLite string literal quoting, and sets user_version = 1.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread Analyzer/Resources/Init.sql Outdated
Comment on lines 139 to 141

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed at the source in 2cc7e92 rather than with IFNULL: refs_view now uses INNER JOIN, so property_path can never be NULL (every refs row's ids are always populated and their lookup rows written in the same transaction). That makes GetString(1) here unconditionally safe. I avoided IFNULL deliberately — it would silently return '' for a genuinely corrupt/orphaned row instead of failing loudly, which would be worse for a read-only analysis DB.

The refs property_path/property_type ids are always populated and their lookup
rows written in the same transaction, so the joins always match. INNER JOIN
expresses that invariant and removes the theoretical NULL that LEFT JOIN implied,
making find-refs' property_path read unconditionally safe.
@SkowronskiAndrew SkowronskiAndrew merged commit 739d383 into main Jul 3, 2026
5 checks passed
@SkowronskiAndrew SkowronskiAndrew deleted the issue44-reduce-ref-table-size branch July 3, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce size of refs table

2 participants