Documentation access will be interrupted from time to time due to some bug correction.


Page tree
Skip to end of metadata
Go to start of metadata

This page explains how to transfer a large amount of data to and from the SCITAS clusters.

Installed tools

To transfer data to and from the data nodes, you can use the following tools:

  • GridFTP (currently unavailable, see below)
  • rsync
  • aspera

...and, of course, good old but not suggested

  • scp
  • sftp



Basic concepts

  • Servers to be used for the file transfer:
    fdata1.epfl.ch
  • The fdata servers are connected to the /home and /work shared filesystems and to Fidis' /scratch  
  • The preferred methods for transferring files with the fdata servers are rsync and GridFTP

  • The following symbols mean:
    • $  →  User prompt
    • #  →  root prompt (full power user, in case you need to install something on your workstation)


Access to the service

Only users of the SCITAS clusters which are also members of the group hpc-datamovers can use the data transfer nodes.

The access is using SSH key exclusively, password authentication is disabled. See below for a detailed how to.

By joining the above group your Gaspar login can be used from the internet to connect to these nodes in a standard SSH port and using SSH key (Password login is disabled). In practice this means there could be attempts to brute force weak SSH keys. While we have taken security measures to limit such attacks we cannot stop them. As such we suggest you use a key length of 4096 bits for RSA or 521 bits for EC,  see some explanations.

Additionally, you can remove yourself from the group once you no longer need remote access.

How to generate an SSH key

On your local machine, generate an SSH key, ensuring a good key length and type

ssh-keygen -b 4096 -t rsa

Copy the public key from your computer

cat ~/.ssh/id_rsa.pub

Connect to a cluster frontend node (helvetios, fidis, deneb1) and paste the public key in the authorized_keys file

mkdir .ssh
chmod 700 .ssh
echo 'ssh-rsa AAAAXXX[...]' >> .ssh/authorized_keys

You should now be able to login without password on all the clusters including the data nodes.



Suggested tools

screen

When copying a large amount of data from one site to another (for example from CSCS to SCITAS), it is convenient to use the screen utility to keep the connection open even if, for example, your laptop loses its network connection.

You can read the screen man page or look up some quick tutorials to learn how to take advantage of this tool.

GridFTP

GridFTP is currently not available in the Data Transfer nodes, please contact us if you would like to use it.


GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. We use the implementation provided by the Globus Toolkit. GridFTP is widely used in large computing facilities, for example at CSCS.


For better performance use this tool - the difference between scp and gridftp becomes very apparent when transferring data over WAN (i.e. high latency) networks.

Testing the connection to fdata servers

A simple test is to list remote directories. To do that, use this command:

$ globus-url-copy -list sshftp://fdata1.epfl.ch/work/

You should get the content of fidis' work directory:

sshftp://fdata1.epfl.ch/work/
    .mmLockDir/
    .mmbackupShadow.1.shome.filesys 
    .mmbackupShadow.1.shome.filesys.old 
    .mmbackupShadow.1.shome.old 
    .snapshots/
    aprl/
    c3mp/
    cosmo/
    csea/
    ctmc/
    fsl/
    gr-fe/
...

Transferring files with GridFTP

For further details of using this tool, such as tuning the transfer, please consult this page.

Copy a file from the data nodes

Synopsis
$ globus-url-copy [options] SOURCE-URL DESTINATION-URL


Example
$ globus-url-copy sshftp://fdata1.epfl.ch/home/john/sim_result.dat sim_result.dat


For better performance
$ globus-url-copy -g2 -cc 2 -tcp-bs 4M sshftp://fdata1.epfl.ch/home/john/sim_result.dat sim_result.dat

Copy a remote directory from the data nodes

Do not forget the trailing slashes ( / )


Example
$ globus-url-copy -r -cd sshftp://fdata1.epfl.ch/home/john/Documents/ mydoc/

Copy a file to the data nodes

Example
$ globus-url-copy -fast sim_result.dat sshftp://fdata1.epfl.ch/home/john/

Copy a directory to the data nodes

Do not forget the trailing slashes ( / )

Example
$ globus-url-copy -fast -r MySims/ sshftp://fdata1.epfl.ch/home/john/MySims/




RSYNC

rsync (Remote SYNChronization) is a utility for efficiently transferring and synchronizing files across computer systems, by checking the timestamp and size of files.

It functions as both a file synchronization and file transfer program. It is a very interesting tool as it can copy only the differences between the source and destination, saving you much time.
However, to do so, it needs some knowledge to use it efficiently. 

Behind the scenes it relies on the SCP protocol to transfer the data. In the examples below, we ask to preserve hard links (-H option) and to use compression (-z option).


The rsync command is installed by default on all Mac and Linux machines. We suggest you to read the man page for more options than what is described here.


For Windows, you have to download and install a specific tool.

Copy a file to the data nodes

  1. Open a terminal
  2. Synchronize (copy) your local file:

    $ rsync -az MySim.data john@fdata1.epfl.ch:/home/john

Copy a directory to the data nodes

  1. Open a terminal

  2. Synchronize (copy) your local directory:

    $ rsync -azH MySims john@fdata1.epfl.ch:/home/john

Be careful here not to type the trailing slash ( / ) after your directory name like this MySims / .
Doing so will only copy the content of this directory, not the directory itself.

Copy a file from the data nodes

  1. Open a terminal
  2. Synchronize (copy) your remote file:

    $ rsync -az john@fdata1.epfl.ch:/home/john/MySim.data /home/michael

Copy a directory from the data nodes

  1. Open a terminal
  2. Synchronize (copy) your remote directory:

    $ rsync -azH john@fdata1.epfl.ch:/home/john/MySims /home/michael/Documents

Be careful here not to type the trailing slash ( / ) after your directory name like this MySims / .
Doing so will only copy the content of this directory, not the directory itself.




aspera 


Aspera is a tool made available in the data nodes mainly for downloading datasets from NIH servers. If you don't know what it is then don't worry!

It allows connections to remote servers running a specific server software.

The necessary componentes of the sra-toolkit and aspera-cli are already installed on the fdata nodes and you have access to the prefetch and ascp commands.

Glossary

  • ascp executable is a command-line FASP transfer program.
    For information on the ascp program, see ascp: Transferring from the Command Line 
  • ascp4 is a FASP transfer program similar to ascp that has been optimized for sending large sets of individual files and can support UDP multicast through Aspera FASPStream.
    For information on A4, see  Transferring with ascp4 
  • prefetch is a command-line tool which is part of the NCBI SRA Toolkit.

Goal of this file transfer method

The purpose of this method is to download files into our clusters, it is not possible to copy data from the fdata nodes using this method.

In particular it is used for access to access sequencing data in the NCBI Sequence Read Archives using the prefetch command of the SRA Toolkit.

Transferring files using prefetch + ascp

In this example we use the prefetch command from the SRA Toolkit to fetch a specific dataset: SRR2761461.

These commands need to be executed on the fdata nodes themselves, so you need to start by logging into one of the nodes.

Example
$ ssh fdata1.epfl.ch
user@fdata1 $ export ASPERAKEY=/etc/asperaweb_id_dsa.openssh
user@fdata1 $ prefetch -v -t ascp -a "$(which ascp)|$ASPERAKEY" SRR2761461

The prefetch command will by default copy the data into a fixed directory structure under your $HOME directory.

Where to find the data
user@fdata1 $ ls $HOME/ncbi/public/sra/SRR2761461.sra 
/home/user/ncbi/public/sra/SRR2761461.sra




Other available tools

You may be used to scp and sftp for transferring files, however, we discourage their usage and we suggest the methods mentioned above.

SCP 

SCP (Secure Copy Protocol) is an older protocol but almost universally supported on Unix-like platforms as part of an SSH protocol suite.


This part describes how to transfer files to/from a remote server by using the scp command.

scp is installed by default on all Mac and Linux machines.

For Windows, you have to download and install a specific tool. We recommend using the program WinSCP .

Copy a file to the data nodes

  1. Open a terminal
  2. Type the scp command as described:

    $ scp path_to_my_file  username@server_name:/path_to_destination

    Example:

    $ scp sim_result.dat john@fdata1.epfl.ch:/scratch/john

Copy a directory to the data nodes

  1. Open a terminal
  2. Copy your directory on the remote server:

    $ scp -r MySims john@fdata1.epfl.ch:/home/john/

Copy a remote file locally

  1. Open a terminal
  2. Type the scp command as described:

    $ scp username@server_name:/path_to_remote_file  path_to_copy_file_locally 

    Ex:

    $ scp john@fdata1.epfl.ch:/scratch/john/sim_result.dat /home/michael/Documents

Copy remote directory locally

  1. Open a terminal
  2. Copy locally your remote directory:

    $ scp -r john@fdata1.epfl.ch:/scratch/john/MySims /home/michael/Documents


SFTP

The SSH File Transfer Protocol (SFTP) is a network protocol that provides file access, file transfer, and file management functionalities over secure connection.

SFTP is more elaborate than SCP, and allows interactive commands to do things like creating/deleting directories and files.


The sftp command is installed by default on all Mac and Linux machines.

For Windows, you have to download and install a specific tool.

We recommend to use the program WinSCP :

Copy local file to the data nodes

  1. Open a terminal
  2. Open a session on the remote server:

    $ sftp john@fdata1.epfl.ch
  3. Copy your file to fdata:

    sftp> put MySim.data 

Copy local directory to the data nodes

  1. First, you have to create a destination directory on the remote server  with the same name of the directory you want to transfer :

    sftp> mkdir MySims
  2. Copy your directory:

    sftp> put -r MySims

Copy a remote file locally

sftp> get myfile.dat

Copy a remote directory locally

sftp> get -r MySims




Annexes

GridFTP Client installation

You'll find here examples of installation for the following Operating Systems (more to come):

  • Ubuntu 16.04.4 LTS
  • Red Hat Enterprise Linux Server release 7.4 (Maipo)

Installation procedure

  • First of all, download the repository from here
  • Select the latest link (http://toolkit.globus.org/toolkit/downloads/6.0/ in this example)

    globus-toolkit-repo_latest_all.deb (Ubuntu/Debian)
    or
    globus-toolkit-repo-latest.noarch.rpm (RedHat/CentOS)
  • Install the repository:

    # dpkg -i globus-toolkit-repo_latest_all.deb (Ubuntu/Debian)
    or
    # rpm -Uvh globus-toolkit-repo-latest.noarch.rpm (RedHat/CentOS)
  • Install the client:

    # apt install globus-data-management-client (Ubuntu/Debian)
    or
    # yum install globus-data-management-client (RedHat/CentOS)


Related articles