`MultiLayerNetwork.output()` 20 times slower than `fit()` for LSTM · Issue #10168 · deeplearning4j/deeplearning4j · GitHub
Skip to content

MultiLayerNetwork.output() 20 times slower than fit() for LSTM #10168

@mzattera

Description

@mzattera

Issue Description

I have trained a LSTM network with 2 LSTM layers of 64 units and an output softmax layer (model attached: LSTMModel.zip).

During training, I use a data set with 5 million example sequences and a batch size of 128. An epoch lasts about 400 minutes on my machine, meaning an average of 5 millisecond to process a sequence, including feed forward, back propagation and evaluation vs. test set (of 100K samples).

I saved the model and I am now using it for predictions, but when I call output() it takes about 90 milliseconds on average for a single call; this is 15-20 times slower than the time it takes during training (and probably double that, considering during training also back propagation happens), so prediction (feed forward) is an order of magnitude slower than training.

I have put timing code wrapping the call to output(), so I am sure it is not due to ETL bottlenecks or other pieces of code (which, in any case, is the same used for training). Test code is attached: Test.java.zip.

Version Information

I am using windows 10, with CUDA and cudNN, as confirmed by debug information:

10:07:17.345 [main] INFO  org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
10:07:19.400 [main] INFO  org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 32
10:07:19.441 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows 10]
10:07:19.441 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Cores: [12]; Memory: [4,0GB];
10:07:19.441 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [CUBLAS]
10:07:19.466 [main] INFO  o.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 11.6.55
10:07:19.467 [main] INFO  o.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [NVIDIA GeForce GTX 1070]; cc: [6.1]; Total memory: [8589803520]
10:07:19.468 [main] INFO  o.nd4j.linalg.jcublas.JCublasBackend - Backend build information:
 MSVC: 192930146
STD version: 201402L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
HAVE_CUDNN

Below extract of my pom.xml with configuration details:

	  <!-- Core -->
	  <dependency>
	      <groupId>org.deeplearning4j</groupId>
	      <artifactId>deeplearning4j-core</artifactId>
	      <version>1.0.0-M2.1</version>
	  </dependency>
       
		<!-- CUDA Engine (I have CUDA 11.8 and cudNN 9.6 installed on my machine-->
		<dependency>
		    <groupId>org.nd4j</groupId>
		    <artifactId>nd4j-cuda-11.6</artifactId>
		    <version>1.0.0-M2.1</version>
		</dependency>
		<dependency>
		    <groupId>org.nd4j</groupId>
		    <artifactId>nd4j-cuda-11.6</artifactId>
		    <version>1.0.0-M2.1</version>
		    <classifier>windows-x86_64-cudnn</classifier>
		</dependency>  
		<dependency>
		    <groupId>org.bytedeco</groupId>
		    <artifactId>cuda-platform-redist</artifactId>
		    <version>11.6-8.3-1.5.7</version>
		</dependency>

Additional Information

See attached model, test code, and dependencies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions