Issue Description
I have trained a LSTM network with 2 LSTM layers of 64 units and an output softmax layer (model attached: LSTMModel.zip).
During training, I use a data set with 5 million example sequences and a batch size of 128. An epoch lasts about 400 minutes on my machine, meaning an average of 5 millisecond to process a sequence, including feed forward, back propagation and evaluation vs. test set (of 100K samples).
I saved the model and I am now using it for predictions, but when I call output() it takes about 90 milliseconds on average for a single call; this is 15-20 times slower than the time it takes during training (and probably double that, considering during training also back propagation happens), so prediction (feed forward) is an order of magnitude slower than training.
I have put timing code wrapping the call to output(), so I am sure it is not due to ETL bottlenecks or other pieces of code (which, in any case, is the same used for training). Test code is attached: Test.java.zip.
Version Information
I am using windows 10, with CUDA and cudNN, as confirmed by debug information:
10:07:17.345 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
10:07:19.400 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 32
10:07:19.441 [main] INFO o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows 10]
10:07:19.441 [main] INFO o.n.l.a.o.e.DefaultOpExecutioner - Cores: [12]; Memory: [4,0GB];
10:07:19.441 [main] INFO o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [CUBLAS]
10:07:19.466 [main] INFO o.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 11.6.55
10:07:19.467 [main] INFO o.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [NVIDIA GeForce GTX 1070]; cc: [6.1]; Total memory: [8589803520]
10:07:19.468 [main] INFO o.nd4j.linalg.jcublas.JCublasBackend - Backend build information:
MSVC: 192930146
STD version: 201402L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
HAVE_CUDNN
Below extract of my pom.xml with configuration details:
<!-- Core -->
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-core</artifactId>
<version>1.0.0-M2.1</version>
</dependency>
<!-- CUDA Engine (I have CUDA 11.8 and cudNN 9.6 installed on my machine-->
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-cuda-11.6</artifactId>
<version>1.0.0-M2.1</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-cuda-11.6</artifactId>
<version>1.0.0-M2.1</version>
<classifier>windows-x86_64-cudnn</classifier>
</dependency>
<dependency>
<groupId>org.bytedeco</groupId>
<artifactId>cuda-platform-redist</artifactId>
<version>11.6-8.3-1.5.7</version>
</dependency>
Additional Information
See attached model, test code, and dependencies.
Issue Description
I have trained a LSTM network with 2 LSTM layers of 64 units and an output softmax layer (model attached: LSTMModel.zip).
During training, I use a data set with 5 million example sequences and a batch size of 128. An epoch lasts about 400 minutes on my machine, meaning an average of 5 millisecond to process a sequence, including feed forward, back propagation and evaluation vs. test set (of 100K samples).
I saved the model and I am now using it for predictions, but when I call
output()it takes about 90 milliseconds on average for a single call; this is 15-20 times slower than the time it takes during training (and probably double that, considering during training also back propagation happens), so prediction (feed forward) is an order of magnitude slower than training.I have put timing code wrapping the call to
output(), so I am sure it is not due to ETL bottlenecks or other pieces of code (which, in any case, is the same used for training). Test code is attached: Test.java.zip.Version Information
I am using windows 10, with CUDA and cudNN, as confirmed by debug information:
Below extract of my pom.xml with configuration details:
Additional Information
See attached model, test code, and dependencies.