This test checks both the performance and the correctness of SHARP operations.
This test supports both HOST and CUDA buffers

Build
=====

To build the tests, just type `make`.

If SHARP is not installed in /opt/mellanox/sharp, you can specify SHARP_HOME.


```shell
$ make SHARP_HOME=/path/to/sharp
```
For MPI support:
```shell
$ make MPI=1 MPI_HOME=/path/to/mpi SHARP_HOME=/path/to/sharp
```

Usage
=====

Single process non-MPI usage:
```shell
   $ SHARP_COLL_LOG_LEVEL=3 sharp_coll_test <test options>
```

Multi-process MPI usage
```shell
   $mpirun <mpi_options>  -x SHARP_COLL_LOG_LEVEL=3 sharp_coll_test <test options>
```
Example:

1. Run allreduce barrier perf test on 2 hosts using port mlx5_0
```shell
	$ mpirun -np 2 -H host1,host2 --map-by node -x SHARP_COLL_LOG_LEVEL=3 ./sharp_coll_test  -d mlx5_0:1 --iters 100 --skip 10 --mode perf --collectives allreduce,barrier
```

2. Run allreduce perf test on 2 hosts using port mlx5_0 with CUDA buffers
```shell
	$ mpirun -np 2 -H host1,host2 --map-by node -x SHARP_COLL_LOG_LEVEL=3 ./sharp_coll_test  -d mlx5_0:1 --iters 100 --skip 10 --mode perf --collectives allreduce -M cuda
```

3. Run allreduce perf test on 2 hosts using port mlx5_0 with Streaming aggregation from 4B to 512MB with cuda buffers
```shell
	$ mpirun -np 2 -H host1,host2 --map-by node -x SHARP_COLL_LOG_LEVEL=3 -x SHARP_COLL_ENABLE_SAT=1 ./sharp_coll_test  -d mlx5_0:1 --iters 100 --skip 10 --mode perf --collectives allreduce -s 4:536870912 -M cuda
```

For more test options: 
```shell
	$sharp_coll_test -h
```

Test ENV options:

ENABLE_SHARP_COLL=0/1  : To enable toggle between the SHARP and MPI interface to run the test
