`MultiLayerNetwork.output()` 20 times slower than `fit()` for LSTM

#### Issue Description

I have trained a LSTM network with 2 LSTM layers of 64 units and an output softmax layer (model attached: [LSTMModel.zip](https://github.com/user-attachments/files/18280588/LSTMModel.zip)).

During training, I use a data set with 5 million example sequences and a batch size of 128. An epoch lasts about 400 minutes on my machine, meaning an average of 5 millisecond to process a sequence, including *feed forward*, back propagation and evaluation vs. test set (of 100K samples).

I saved the model and I am now using it for predictions, but when I call `output()` it takes about 90 milliseconds on average for a single call; this is 15-20 times slower than the time it takes during training (and probably double that, considering during training also back propagation happens), so prediction (feed forward) is an order of magnitude slower than training.

I have put timing code wrapping the call to `output()`, so I am sure it is not due to ETL bottlenecks or other pieces of code (which, in any case, is the same used for training). Test code is attached: [Test.java.zip](https://github.com/user-attachments/files/18280578/Test.java.zip).

#### Version Information

I am using windows 10, with CUDA and cudNN, as confirmed by debug information:

```
10:07:17.345 [main] INFO  org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
10:07:19.400 [main] INFO  org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 32
10:07:19.441 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows 10]
10:07:19.441 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Cores: [12]; Memory: [4,0GB];
10:07:19.441 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [CUBLAS]
10:07:19.466 [main] INFO  o.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 11.6.55
10:07:19.467 [main] INFO  o.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [NVIDIA GeForce GTX 1070]; cc: [6.1]; Total memory: [8589803520]
10:07:19.468 [main] INFO  o.nd4j.linalg.jcublas.JCublasBackend - Backend build information:
 MSVC: 192930146
STD version: 201402L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
HAVE_CUDNN
```

Below extract of my pom.xml with configuration details:

```
	  
	  <dependency>
	      <groupId>org.deeplearning4j</groupId>
	      <artifactId>deeplearning4j-core</artifactId>
	      <version>1.0.0-M2.1</version>
	  </dependency>
       
		
		<dependency>
		    <groupId>org.nd4j</groupId>
		    <artifactId>nd4j-cuda-11.6</artifactId>
		    <version>1.0.0-M2.1</version>
		</dependency>
		<dependency>
		    <groupId>org.nd4j</groupId>
		    <artifactId>nd4j-cuda-11.6</artifactId>
		    <version>1.0.0-M2.1</version>
		    <classifier>windows-x86_64-cudnn</classifier>
		</dependency>  
		<dependency>
		    <groupId>org.bytedeco</groupId>
		    <artifactId>cuda-platform-redist</artifactId>
		    <version>11.6-8.3-1.5.7</version>
		</dependency>
```

#### Additional Information

See attached model, test code, and dependencies.

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`MultiLayerNetwork.output()` 20 times slower than `fit()` for LSTM #10168

Issue Description

Version Information

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sunbelt Computer Software

PL/B Language Development and Support

MultiLayerNetwork.output() 20 times slower than fit() for LSTM #10168

Description

Issue Description

Version Information

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`MultiLayerNetwork.output()` 20 times slower than `fit()` for LSTM #10168