Multithreaded replication WIP by meiji163 · Pull Request #1454 · github/gh-ost · GitHub
Skip to content

Multithreaded replication WIP#1454

Draft
meiji163 wants to merge 67 commits intomasterfrom
meiji163/parallel-repl
Draft

Multithreaded replication WIP#1454
meiji163 wants to merge 67 commits intomasterfrom
meiji163/parallel-repl

Conversation

@meiji163
Copy link
Copy Markdown
Contributor

@meiji163 meiji163 commented Oct 4, 2024

Description

This PR introduces multi-threaded replication for applying DML queries to the ghost table. The goal is to be able to migrate tables with high rate of DML queries (e.g. >5k rows/s). Currently gh-ost lags behind in these situations, taking a very long time to complete or not completing at all.

Similar to MySQL replication threads, gh-ost will stream binlog events from the source and group them into transactions. It then submits the transactions to a pool of workers to apply the transactions concurrently on the ghost table. We ensure that dependent transactions are applied in a consistent order (equivalent to MySQL multi-threaded replication with replica_parallel_type=LOGICAL_CLOCK and replica_preserve_commit_order=0).

With WRITESET enabled on the source, this enables a great amount of parallelism in the transaction applier.

Changes

TODO

Performance tests

TODO

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

mhamza15
mhamza15 previously approved these changes Apr 1, 2025
* Remove error return value since we don't use it.

* Lock the mutex whenever we plan to update the low watermark to avoid a race condition.

* Check for data races in our unit tests.

* Still return an error from ProcessEventsUntilDrained but actually check it in our code.

* Make coordinator_test.go to check the err from  ProcessEventsUntilDrained again

* Remove unreachable return in ProcessEventsUntilDrained
hugodorea
hugodorea previously approved these changes Apr 9, 2025
…ark (#1531)

* Notify waiting channels on completed transaction, not just the watermark.

* Add checksum validation to coordinator test

* Use errgroup to perform transactions concurrently in coordinator_test.go

* Configure concurrency separate from total number of transactions.

* Run similar number of txs to previous test and ignore context.

* Have at least 1 child in a transaction.

* Notify waiting channels for the current sequence number.
hugodorea
hugodorea previously approved these changes Apr 10, 2025
@meiji163
Copy link
Copy Markdown
Contributor Author

meiji163 commented Apr 12, 2025

@TomKnaepen
Copy link
Copy Markdown

Hi @meiji163, my team has been running into the exact limits you describe in this PR, so we were very excited to see the progress already made here. However, over the last 2 weeks I've run some tests with this PR and still notice a data inconsistency issue.

We use gh-ost on AWS Aurora, and my testing has been with a production-like setup. On the table we're trying to migrate the only interesting queries (not SELECTs) are fairly simple: we insert rows and update some of them fairly quickly to remove a timestamp value.

The inconsistency I see is caused by some UPDATEs not being applied to the ghost table. This happens infrequently but consistently throughout the tests (50 rows affected out of ~2 million).
Here's an example problematic row:

#250516 13:06:46 server id 1951482084  end_log_pos 26539240 CRC32 0x364830a9   Anonymous_GTID  last_committed=1822 sequence_number=1824    rbr_only=yes    original_committed_timestamp=1747393606597218   immediate_commit_timestamp=1747393606597218 transaction_length=3944
# at 26539538
### INSERT INTO `DB_ACTIVE`.`message`
### SET
###   @1=39181433

--

#250516 13:06:46 server id 1951482084  end_log_pos 26589750 CRC32 0x2b5cf4fe   Anonymous_GTID  last_committed=1867 sequence_number=1868    rbr_only=yes    original_committed_timestamp=1747393606648774   immediate_commit_timestamp=1747393606648774 transaction_length=7447
# at 26590057
### UPDATE `DB_ACTIVE`.`message`
### WHERE
###   @1=39181433
###   @52=1747480006593
### SET
###   @1=39181433
###   @52=NULL

--

#250516 13:06:46 server id 1951482084  end_log_pos 28141453 CRC32 0xd8e3f39c   Anonymous_GTID  last_committed=2029 sequence_number=2030    rbr_only=yes    original_committed_timestamp=1747393606747371   immediate_commit_timestamp=1747393606747371 transaction_length=3949
# at 28141756
### INSERT INTO `DB_ACTIVE`.`_message_gho`
### SET
###   @1=39181433

It doesn't quite look like the same issue you ran into and described above because the dependency graph looked fairly linear:

last_committed=1822 sequence_number=1824 -- INSERT
last_committed=1824 sequence_number=1825
last_committed=1825 sequence_number=1826
last_committed=1826 sequence_number=1828
last_committed=1828 sequence_number=1832
last_committed=1832 sequence_number=1836
last_committed=1836 sequence_number=1837
last_committed=1837 sequence_number=1840
last_committed=1840 sequence_number=1844
last_committed=1844 sequence_number=1854
last_committed=1854 sequence_number=1856
last_committed=1856 sequence_number=1857
last_committed=1857 sequence_number=1858
last_committed=1858 sequence_number=1862
last_committed=1862 sequence_number=1863
last_committed=1863 sequence_number=1866
last_committed=1866 sequence_number=1867
last_committed=1867 sequence_number=1868 -- UPDATE

Do you have any ideas what could be the issue, and is there anything else I can provide?

Thank you!

@meiji163
Copy link
Copy Markdown
Contributor Author

Thanks for reporting this @TomKnaepen, I also ran into more data consistency issues after the latest changes, but I haven't had time to investigate further. We thought it might be related to the binlogsyncer connection being closed and reopened. Is there any error in your gh-ost output?

@TomKnaepen
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants