You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tiezhu edited this page Aug 15, 2022
·
8 revisions
ChunJun
Introduce
ChunJun(formerly known as FlinkX), is a data integration framework based on Flink, which is stable, easy to use, efficient, and integrated with DataStream/DataSet API. It can realize data synchronization and calculation between various heterogeneous data sources. ChunJun has been deployed and running stably in thousands of companies so far.
ChunJun abstracts different databases into reader/source plugins, writer/sink plugins and lookup plugins, and it has the following features:
Based on the real-time computing engine--Flink, and supports JSON template and SQL script configuration tasks. The SQL script is compatible with Flink SQL syntax;
Support distributed operation, support flink-standalone, yarn-session, yarn-per job and other submission methods;
Support Docker one-click deployment, support deploy and run on k8s;
Supports a variety of heterogeneous data sources, and supports synchronization and calculation of more than 20 data sources such as MySQL, Oracle, SQLServer, Hive, Kudu, etc.
Easy to expand, highly flexible, newly expanded data source plugins can integrate with existing data source plugins instantly, plugin developers do not need to care about the code logic of other plugins;
Not only supports full synchronization, but also supports incremental synchronization and interval training;
Not only supports offline synchronization and calculation, but also compatible with real-time scenarios;
Support dirty data storage, and provide indicator monitoring, etc.;
Cooperate with the flink checkpoint mechanism to achieve breakpoint resuming, task disaster recovery;
Not only supports synchronizing DML data, but also supports DDL synchronization, like 'CREATE TABLE', 'ALTER COLUMN', etc.;
Build And Compilation
Get the code
Use the git to clone the code of ChunJun
git clone https://github.com/DTStack/chunjun.git
build
Execute the command in the project directory.
./mvnw clean package -DskipTests
Or execute
sh build/build.sh
Multi-platform compatible
Chunjun currently supports tdh and open-source hadoop platforms, and different platforms need to be packaged with different maven commands.
Hadoop Platformas
Comment
tdh
mvn clean package -DskipTests -P default,tdh
Package the inceport plugin and plugins supported by default
default
mvn clean package -DskipTests -P default
Package the all plugins except the inceptor plugin.
Common problem
1.Can not find dependencies
Solution: There are some driver packages in the directory '$ChunJun_HOME/jars', and you can install these dependencies manually or execute the command below:
## windows
./$CHUNJUN_HOME/bin/install_jars.bat
## unix
./$CHUNJUN_HOME/bin/install_jars.sh
2. Compiling module 'ChunJun-core' then throws 'Failed to read artifact descriptor for com.google.errorprone:javac-shaded'
The following table shows the correspondence between the branches of ChunJun and the version of flink. If the versions are not aligned, problems such as 'Serialization Exceptions', 'NoSuchMethod Exception', etc. mysql occur in tasks.
Branches
Flink version
master
1.12.7
1.12_release
1.12.7
1.10_release
1.10.1
1.8_release
1.8.3
ChunJun supports running tasks in multiple modes. Different modes depend on different environments and steps. The following are
Local
Local mode does not depend on the Flink environment and Hadoop environment, and starts a JVM process in the local environment to perform tasks.
Steps
Go to the directory of 'chunjun-dist' and execute the command below:
sh bin/chunjun-local.sh -job $SCRIPT_PATH
The parameter of "$SCRIPT_PATH" means 'the path where the task script is located'.
After execute, you can perform a task locally.
Standalone mode depend on the Flink Standalone environment and does not depend on the Hadoop environment.
Steps
1. Start Flink Standalone Cluster
sh $FLINK_HOME/bin/start-cluster.sh
After the startup is successful, the default port of Flink Web is 8081, which you can configure in the file of 'flink-conf.yaml'. We can access the 8081 port of the current machine to enter the flink web of standalone cluster.
2. Submit task
Go to the directory of 'chunjun-dist' and execute the command below:
sh bin/chunjun-standalone.sh -job chunjun-examples/json/stream/stream.json
After the command execute successfully, you can observe the task staus on the flink web.
YarnSession mode depends on the Flink jars and Hadoop environments, and the yarn-session needs to be started before the task is submitted.
Steps
1. Start yarn-session environment
Yarn-session mode depend on Flink and Hadoop environment. You need to set $HADOOP_HOME and $FLINK_HOME in advance, and we need to upload 'chunjun-dist' with yarn-session '-t' parameter.
Get the application id $SESSION_APPLICATION_ID corresponding to the yarn-session through yarn web, then enter the directory 'chunjun-dist' and execute the command below:
sh ./bin/chunjun-yarn-session.sh -job chunjun-examples/json/stream/stream.json -confProp {\"yarn.application.id\":\"SESSION_APPLICATION_ID\"}
'yarn.application.id' can also be set in 'flink-conf.yaml'.
After the submission is successful, the task status can be observed on the yarn web.