This page explains how to transfer a large amount of data to and from the SCITAS clusters.
Installed tools
To transfer data to and from the data nodes, you can use the following tools:
- GridFTP (partially supported, see below)
- rsync
- aspera
...and, of course, good old but not suggested
- scp
- sftp
Basic concepts
- Servers to be used for the file transfer:
fdata1.epfl.ch - The fdata servers are connected to the /home and /work shared filesystems and to Fidis' /scratch
The preferred methods for transferring files with the fdata servers are rsync and GridFTP
- The following symbols mean:
- $ → User prompt
- # → root prompt (full power user, in case you need to install something on your workstation)
Access to the service
Only users of the SCITAS clusters which are also members of the group hpc-datamovers can use the data transfer nodes. You can add yourself to this group by clicking on the following link "Register in this group" (login with the GASPAR account).
The access is using SSH key exclusively, password authentication is disabled. See below for a detailed how to.
By joining the above group your Gaspar login can be used from the internet to connect to these nodes in a standard SSH port and using SSH key (Password login is disabled). In practice this means there could be attempts to brute force weak SSH keys. While we have taken security measures to limit such attacks we cannot stop them. As such we suggest you use a key length of 4096 bits for RSA or 521 bits for EC, see some explanations.
Additionally, you can remove yourself from the group once you no longer need remote access.
How to generate an SSH key
On your local machine, generate an SSH key, ensuring a good key length and type
ssh-keygen -b 4096 -t rsa
Copy the public key from your computer
cat ~/.ssh/id_rsa.pub
Connect to a cluster frontend node (helvetios, fidis, izar) and paste the public key in the authorized_keys file
mkdir .ssh chmod 700 .ssh echo 'ssh-rsa AAAAXXX[...]' >> .ssh/authorized_keys
You should now be able to login without password on all the clusters including the data nodes.
Suggested tools
screen
When copying a large amount of data from one site to another (for example from CSCS to SCITAS), it is convenient to use the screen utility to keep the connection open even if, for example, your laptop loses its network connection.
You can read the screen man page or look up some quick tutorials to learn how to take advantage of this tool.
GridFTP
GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. We use the implementation provided by the Globus Toolkit. GridFTP is widely used in large computing facilities, for example at CSCS.
For better performance use this tool - the difference between scp and gridftp becomes very apparent when transferring data over WAN (i.e. high latency) networks.
Installation
on debian/ubuntu systems
$ sudo apt install globus-gass-copy-progs
Testing the connection to fdata servers
A simple test is to list remote directories. To do that, use this command:
$ globus-url-copy -list sshftp://$USER@fdata1.epfl.ch/work/
You should get the content of fidis' work directory:
sshftp://fdata1.epfl.ch/work/ .mmLockDir/ .mmbackupShadow.1.shome.filesys .mmbackupShadow.1.shome.filesys.old .mmbackupShadow.1.shome.old .snapshots/ aprl/ c3mp/ cosmo/ csea/ ctmc/ fsl/ gr-fe/ ...
Transferring files with GridFTP
For further details of using this tool, such as tuning the transfer, please consult this page.
Copy a file from the data nodes
$ globus-url-copy [options] SOURCE-URL DESTINATION-URL
$ globus-url-copy sshftp://$USER@fdata1.epfl.ch/home/john/sim_result.dat sim_result.dat
$ globus-url-copy -g2 -cc 2 -tcp-bs 4M sshftp://$USER@fdata1.epfl.ch/home/john/sim_result.dat sim_result.dat
Copy a remote directory from the data nodes
Do not forget the trailing slashes ( / )
$ globus-url-copy -r -cd sshftp://$USER@fdata1.epfl.ch/home/john/Documents/ mydoc/
Copy a file to the data nodes
$ globus-url-copy -fast sim_result.dat sshftp://$USER@fdata1.epfl.ch/home/john/
Copy a directory to the data nodes
Do not forget the trailing slashes ( / )
$ globus-url-copy -fast -r MySims/ sshftp://$USER@fdata1.epfl.ch/home/john/MySims/
RSYNC
rsync (Remote SYNChronization) is a utility for efficiently transferring and synchronizing files across computer systems, by checking the timestamp and size of files.
It functions as both a file synchronization and file transfer program. It is a very interesting tool as it can copy only the differences between the source and destination, saving you much time.
However, to do so, it needs some knowledge to use it efficiently.
Behind the scenes it relies on the SCP protocol to transfer the data. In the examples below, we ask to preserve hard links (-H option) and to use compression (-z option).
The rsync command is installed by default on all Mac and Linux machines. We suggest you to read the man page for more options than what is described here.
For Windows, you have to download and install a specific tool.
Copy a file to the data nodes
- Open a terminal
Synchronize (copy) your local file:
$ rsync -az MySim.data john@fdata1.epfl.ch:/home/john
Copy a directory to the data nodes
Open a terminal
Synchronize (copy) your local directory:
$ rsync -azH MySims john@fdata1.epfl.ch:/home/john
Be careful here not to type the trailing slash ( / ) after your directory name like this MySims / .
Doing so will only copy the content of this directory, not the directory itself.
Copy a file from the data nodes
- Open a terminal
Synchronize (copy) your remote file:
$ rsync -az john@fdata1.epfl.ch:/home/john/MySim.data /home/michael
Copy a directory from the data nodes
- Open a terminal
Synchronize (copy) your remote directory:
$ rsync -azH john@fdata1.epfl.ch:/home/john/MySims /home/michael/Documents
Be careful here not to type the trailing slash ( / ) after your directory name like this MySims / .
Doing so will only copy the content of this directory, not the directory itself.
aspera
Aspera is a tool made available in the data nodes mainly for downloading datasets from NIH servers. If you don't know what it is then don't worry!
It allows connections to remote servers running a specific server software.
The necessary componentes of the sra-toolkit and aspera-cli are already installed on the fdata nodes and you have access to the prefetch
and ascp
commands.
Glossary
- ascp executable is a command-line FASP transfer program.
For information on the ascp program, see ascp: Transferring from the Command Line - ascp4 is a FASP transfer program similar to ascp that has been optimized for sending large sets of individual files and can support UDP multicast through Aspera FASPStream.
For information on A4, see Transferring with ascp4 - prefetch is a command-line tool which is part of the NCBI SRA Toolkit.
Goal of this file transfer method
The purpose of this method is to download files into our clusters, it is not possible to copy data from the fdata nodes using this method.
In particular it is used for access to access sequencing data in the NCBI Sequence Read Archives using the prefetch command of the SRA Toolkit.
Transferring files using prefetch + ascp
In this example we use the prefetch
command from the SRA Toolkit to fetch a specific dataset: SRR2761461.
These commands need to be executed on the fdata nodes themselves, so you need to start by logging into one of the nodes.
$ ssh fdata1.epfl.ch user@fdata1 $ export ASPERAKEY=/etc/asperaweb_id_dsa.openssh user@fdata1 $ prefetch -v -t ascp -a "$(which ascp)|$ASPERAKEY" SRR2761461
The prefetch
command will by default copy the data into a fixed directory structure under your $HOME
directory.
user@fdata1 $ ls $HOME/ncbi/public/sra/SRR2761461.sra /home/user/ncbi/public/sra/SRR2761461.sra
Other available tools
You may be used to scp and sftp for transferring files, however, we discourage their usage and we suggest the methods mentioned above.
SCP
SCP (Secure Copy Protocol) is an older protocol but almost universally supported on Unix-like platforms as part of an SSH protocol suite.
This part describes how to transfer files to/from a remote server by using the scp command.
scp is installed by default on all Mac and Linux machines.
For Windows, you have to download and install a specific tool. We recommend using the program WinSCP .
Copy a file to the data nodes
- Open a terminal
Type the scp command as described:
$ scp path_to_my_file username@server_name:/path_to_destination
Example:
$ scp sim_result.dat john@fdata1.epfl.ch:/scratch/john
Copy a directory to the data nodes
- Open a terminal
Copy your directory on the remote server:
$ scp -r MySims john@fdata1.epfl.ch:/home/john/
Copy a remote file locally
- Open a terminal
Type the scp command as described:
$ scp username@server_name:/path_to_remote_file path_to_copy_file_locally
Ex:
$ scp john@fdata1.epfl.ch:/scratch/john/sim_result.dat /home/michael/Documents
Copy remote directory locally
- Open a terminal
Copy locally your remote directory:
$ scp -r john@fdata1.epfl.ch:/scratch/john/MySims /home/michael/Documents
SFTP
The SSH File Transfer Protocol (SFTP) is a network protocol that provides file access, file transfer, and file management functionalities over secure connection.
SFTP is more elaborate than SCP, and allows interactive commands to do things like creating/deleting directories and files.
The sftp command is installed by default on all Mac and Linux machines.
For Windows, you have to download and install a specific tool.
We recommend to use the program WinSCP :
Copy local file to the data nodes
- Open a terminal
Open a session on the remote server:
$ sftp john@fdata1.epfl.ch
Copy your file to fdata:
sftp> put MySim.data
Copy local directory to the data nodes
First, you have to create a destination directory on the remote server with the same name of the directory you want to transfer :
sftp> mkdir MySims
Copy your directory:
sftp> put -r MySims
Copy a remote file locally
sftp> get myfile.dat
Copy a remote directory locally
sftp> get -r MySims
Annexes
GridFTP Client installation
You'll find here examples of installation for the following Operating Systems (more to come):
- Ubuntu 16.04.4 LTS
- Red Hat Enterprise Linux Server release 7.4 (Maipo)
Installation procedure
- First of all, download the repository from here
Select the latest link (http://toolkit.globus.org/toolkit/downloads/6.0/ in this example)
globus-toolkit-repo_latest_all.deb (Ubuntu/Debian) or globus-toolkit-repo-latest.noarch.rpm (RedHat/CentOS)
Install the repository:
# dpkg -i globus-toolkit-repo_latest_all.deb (Ubuntu/Debian) or # rpm -Uvh globus-toolkit-repo-latest.noarch.rpm (RedHat/CentOS)
Install the client:
# apt install globus-data-management-client (Ubuntu/Debian) or # yum install globus-data-management-client (RedHat/CentOS)