Parquet File Metadata caching implementation by shiyer7474 · Pull Request #541 · Altinity/ClickHouse · GitHub
Skip to content

Parquet File Metadata caching implementation#541

Closed
shiyer7474 wants to merge 10 commits into
project-antalyafrom
metadata_cache_for_parquet
Closed

Parquet File Metadata caching implementation#541
shiyer7474 wants to merge 10 commits into
project-antalyafrom
metadata_cache_for_parquet

Conversation

@shiyer7474

Copy link
Copy Markdown

Implements Parquet Metadata caching.

More details and documentation in link.

Changelog category (leave one):

  • New Feature

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/


Modify your CI run:

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Unit tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with Analyzer
  • Add your option here

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • Add your option here

Extra options:

  • do not test (only style check)
  • disable merge-commit (no merge from master before tests)
  • disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • 1
  • 2
  • 3
  • 4

@altinity-robot

altinity-robot commented Dec 1, 2024

Copy link
Copy Markdown
Collaborator

@svb-alt

svb-alt commented Dec 6, 2024

Copy link
Copy Markdown
Collaborator

resolves #480
resolves #481

{
if (!use_metadata_cache || !metadata_cache_key.length())
{
ProfileEvents::increment(ProfileEvents::ParquetMetaDataCacheMisses);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want metadata cache miss metrics being incremented when use_metadata_cache=false?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I coded the increment in the first pass, but removed it so that users need not see a new alarming "miss" metric if they are unaware/disabled the in-memory cache feature (and satisfied with the existing local disk caching feature). Let me know what you think.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I did not quite follow your explanation. As far as I can tell (by reading the code), whenever initializeIfNeeded() is called (and it is called once per file, iirc), it'll try to fetch metadata either from cache or from input source.

If cache is disabled, it'll still increment the cache miss. It doesn't look right to me at a first glance, but if there has been a discussion and this was the chosen design, that's ok.

Am I missing something?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I mixed up my revisions and reasoning. I have removed the increment if cache is disabled. Thanks!

Comment thread src/Processors/Formats/Impl/ParquetBlockInputFormat.cpp
@@ -0,0 +1,28 @@
-- Tags: no-parallel, no-fasttest

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a few tests for parquet_metadata_cache_max_entries?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would have to be an integration test, maybe with 10s or 100's of parquet files. I can add it in another PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If local files also benefited from metadata cache, an integration test wouldn't be needed I suppose. But doesn't look like we want to do it

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For local Parquet files, OS file cache will be in effect.


private:
ParquetFileMetaDataCache(UInt64 max_cache_entries);
static std::mutex mutex;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to me CacheBase already has a mutex https://github.com/ClickHouse/ClickHouse/blob/94709d7a5a1ede65a9eed92ff9cbf73e28e62561/src/Common/CacheBase.h#L245.

Why do we need another one? And why is it static?

I think I am missing something. ParquetFileMetaDataCache is a static instance, shared among all instances. Why do you need to make this mutex static as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, basic double checked locking pattern of singleton initialization.

I will replace this with the modern equivalent that uses std::once_only.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check my comment in https://github.com/Altinity/ClickHouse/pull/541/files#r1875950244. I might be missing something, tho

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread src/Processors/Formats/Impl/ParquetBlockInputFormat.h Outdated
Comment thread src/Processors/Formats/Impl/ParquetBlockInputFormat.h Outdated
{
std::lock_guard lock(mutex);
static std::once_flag once;
std::call_once(once, [&] {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's take a step back. IIRC, static initialization in c++ 11 or greater is guaranteed to happen only once and it is thread safe.

From the cpp standard:

such a variable is initialized the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization. [...] If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization.

So, wouldn't something like the below suffice?:

static ParquetFileMetaDataCache instance(max_cache_entries);
return instance;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/Core/Settings.h
Comment thread src/Core/ServerSettings.h

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned it into a server settings, makes more sense as it can't be changed at runtime

@Enmk Enmk deleted the branch project-antalya February 20, 2025 15:23
@Enmk Enmk closed this Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants