Comparing StaticAnalysisTools:master...google:master · StaticAnalysisTools/codesearch · GitHub
Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: StaticAnalysisTools/codesearch
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: google/codesearch
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 12 commits
  • 19 files changed
  • 1 contributor

Commits on May 26, 2024

  1. index: adjust code in preparation for large indexes

    The old code was written with care to avoid problems
    with >2GB indexes on 32-bit machines, using uint32
    offsets throughout. But then it mmap'ed the index file,
    which wouldn't work with a >2GB index anyway, so none
    of that code was really useful in the end.
    
    Move the code to use int for offsets and values throughout.
    This will still not work well with >2GB indexes on 32-bit
    machines, but again the code already doesn't work in that case.
    Now it may at least diagnose the problem a bit better.
    
    The old uint32 code did work well for up to 4GB indexes
    on 64-bit machines. This is in preparation for >4GB indexes.
    rsc committed May 26, 2024
    Configuration menu
    Copy the full SHA
    950529d View commit details
    Browse the repository at this point in the history
  2. index: add 64-bit index support

    Writing is controlled (for testing) by a global bool.
    Reading automatically adapts to the input file.
    rsc committed May 26, 2024
    Configuration menu
    Copy the full SHA
    7752803 View commit details
    Browse the repository at this point in the history
  3. index: use a single temp index file instead of one per sorted segment

    Writing a new index file makes very little sense.
    The only possible benefit would be if there was
    memory fragmentation that meant individual 64MB
    mmaps would fit but one giant one would not.
    That's a rather specific condition that doesn't merit
    the complexity. Also, some systems have a low fd limit,
    so keeping one file open per segment might run into that
    for large indexes.
    rsc committed May 26, 2024
    Configuration menu
    Copy the full SHA
    cb34a47 View commit details
    Browse the repository at this point in the history
  4. index: write smaller temp files to disk

    The old code was writing out the posting lists as raw uint64 slices,
    but the entropy on these is quite low. Indexing the Linux 5.4 tree,
    the old code wrote 12 segments of 64 MB each = 768 MB.
    The new code writes 12 segments of ~9.2 MB each = 110 MB.
    The factor of almost exactly 7 is consistent with other trees too,
    although I can't explain why 7.
    
    If the index ends up being 10 GB, not writing 70 GB of temp data
    is a good thing on systems without tons of free disk space.
    rsc committed May 26, 2024
    Configuration menu
    Copy the full SHA
    e0ecb01 View commit details
    Browse the repository at this point in the history
  5. index: γ-encode posting lists in 64-bit index

    Since the 64-bit indexes will be unintelligible to
    32-bit csearch anyway, we can take this opportunity
    to make other file format adjustments.
    
    Change the encoding of posting lists from byte-level
    varint encoding to bit-level γ-encoding. The minimum
    number of bits for each fileid delta drops from 8 to 3.
    Indexing the Linux 5.4 tree, the 12 sorted segments that
    were reduced from 768 MB to 110 MB by varint-encoding
    are now reduced further to 77 MB by gamma-encoding.
    The index itself is reduced from 119 MB to 87 MB.
    
    The time required to index the tree is up 30%, from
    about 14 seconds to about 18 seconds.
    
    The time required for csearch -l '^printf' in the Linux index
    is perhaps up 2%, from 0.46 to 0.47 seconds.
    rsc committed May 26, 2024
    Configuration menu
    Copy the full SHA
    9c4247d View commit details
    Browse the repository at this point in the history

Commits on Jun 19, 2024

  1. cindex: index content in zip files when using -zip

    A zip file z.zip containing file f is named "z.zip#f" in messages.
    A more natural syntax was z.zip:f but the : seemed to get lost
    in large lists of file names, especially compared to its importance.
    Worse, with "z.zip:f:match" it was unclear whether "z.zip" contained
    "f:match" or "z.zip" contained file "f" which contained "match".
    "z.zip#f:match" is clearer and evokes HTML URL anchor fragments.
    rsc committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    8c6afba View commit details
    Browse the repository at this point in the history
  2. index: compact 64-bit index format

    Prefix-compress root and name lists in blocks of 16 paths.
    The prefix compression saves 4X, even more with long names.
    The blocking preserves random access to the name list and
    also reduces the name index size by 16X.
    
    Improve γ-coding of posting lists by inserting 0 at 16 instead
    of inserting it at 1. Since 1 is the most common coded value,
    it should keep the 1-bit representation. Empirically, mapping 0
    into the 4-bit values is the best choice for space savings.
    
    Varint-encode the posting list index entries, saving 2-3X.
    This matters most for small indexes; in large indexes the
    posting lists themselves dominate the posting list index.
    rsc committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    fad9cf8 View commit details
    Browse the repository at this point in the history
  3. csearch, cgrep: add -A, -B, -C, -V (cgrep only) flags

    Also start on HTML output for a possible web interface.
    
    cmd/cgrep: add -v flag
    rsc committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    e53428b View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    fde726f View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    8aad6cc View commit details
    Browse the repository at this point in the history
  6. csearch: add -html flag

    Experimental and not very useful yet.
    rsc committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    d481e2c View commit details
    Browse the repository at this point in the history
  7. csweb: very basic web interface

    rsc committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    b34f2a0 View commit details
    Browse the repository at this point in the history
Loading