Comparteix:

Configuració CALCULA

Welcome to the Calcula Computing Service

As you begin your work, you will quickly find that the computational demands of training models and processing large datasets exceed what a standard laptop can handle. To support your research, we provide Calcula, a high-performance shared computing cluster.

Because Calcula is a multiuser environment managed by a system administrator, your experience here will differ from a personal computer. You do not have "administrative" or "root" privileges; you cannot install software or libraries globally. Instead, you have your own home folder with a limited disk space quota. You must use specific tools to manage your own software and hardware access within these constraints.

Security & Access: To protect our resources, the cluster is not directly open to the internet. The very first step to using Calcula is connecting via VPN. We provide you with a .ovpn configuration file and credentials; you must have this connection active before you can log in to the system.

To access the cluster, we use the SSH (Secure Shell) protocol. If you are a Windows user, we recommend using MobaXterm, as it provides an integrated terminal along with a graphical file browser to easily manage your data. macOS and Linux users can connect directly through their built-in system terminal. Refer to the connection instructions for the specific research group you are working with for details on the ssh command.

1. The Architecture: Why You Can't "Just Run It"

Calcula is split into two distinct parts to ensure the system remains stable for everyone.

  • The Access Server (The Entry Point): This is where you land when you log in via SSH. It is a shared "front door" for editing code and managing files. Because many users are logged in simultaneously and the server has modest RAM and CPU resources, heavy processes must never be run here. If you try to run a training script or a heavy simulation on the access server, it will likely fail due to a lack of resources or, worse, saturate the server and prevent your colleagues from working.
  • The Computing Nodes (The Muscles): These are separate, powerful servers equipped with massive amounts of RAM and multiple GPUs. This is where the real work happens.
  • SLURM (The Traffic Cop): Since you cannot run code directly on the compute nodes, you use SLURM. You submit a "job" to SLURM, and it schedules your task to run on a computing node as soon as the requested resources are available. When you want to run a job, you "ask" SLURM for a resource (e.g., "I need 1 GPU and 32GB of RAM"), and it places your job in a queue, running it on a Compute Node as soon as one is free.

A Seamless Experience: Although your code runs on a different physical server, the process is largely transparent. All servers share the same networked storage. This means your files, code, and datasets are available on every node automatically—you do not have to manually transfer files between servers.

2. Virtual Environments: Your Personal Sandbox

While Modules handle system-level libraries (like CUDA), Virtual Environments (Conda or Virtualenv) handle your Python packages (like PyTorch, OpenCV, or NumPy).

Because you only have write access to your home folder and must stay within your 50 GB quota, you cannot use pip install globally. Virtual environments solve this:

  • Isolation: A virtual environment is a private folder containing its own Python executable and its own set of installed libraries.
  • Safety: You can install, delete, or break things inside your environment without affecting the rest of the system or other researchers.
  • Reproducibility: You can "freeze" your environment so that when you publish your paper, others can recreate the exact same setup you used.
  • Multiple Contexts: You can create as many environments as you like. You might have one for "Object Detection" and another for "Generative Models." You can switch between them easily, ensuring that updating a package for one project doesn't accidentally break another.

You only need to create an environment once. However, just like modules, you must activate it every time you log in to tell Python to use your local folder instead of the system default.

Choosing Your Environment Manager: Conda vs. Virtualenv

When deciding between Conda and Virtualenv, it is important to consider how each interacts with the Calcula infrastructure.

  • Conda: Can install non-Python dependencies (like specific C++ libraries) directly into your environment, making it robust for complex stacks. However, it is heavy and consumes significant disk space. With your 50 GB quota, monitor your usage closely.
  • Virtualenv: Much more disk-efficient and lightweight. However, it relies more on the Modules system; if your code needs a specific CUDA version, you must load the corresponding Module yourself, as Virtualenv won't install it for you.

3. Persistent Sessions with tmux

In research, it is common to run experiments that take several hours or even days to complete. If you are connected to the server via a standard SSH session and your Wi-Fi drops or you close your laptop, your connection is severed and any active processes tied to that session will be killed.

To prevent this, we use a terminal multiplexer called tmux. It keeps your terminal "alive" on the server even when you are disconnected. You can start a tmux session, start your work, and then "detach." The server keeps that session running in the background, allowing you to "re-attach" later from any location to check on your progress.

Summary of the Workflow

​​To use Calcula, your routine will always follow these steps:

  1. Connect to the VPN using your .ovpn file and credentials.
  2. Log in to the Access Server via SSH (using MobaXterm or your terminal).
  3. Start/Attach a tmux session to ensure your work persists if you lose connection.
  4. Load Modules (e.g., module load cuda/13.0 cudnn/8.7) to access necessary system libraries.
  5. Activate Environment (e.g., conda activate my_env) to access your Python packages.
  6. Submit to SLURM to move your execution from the modest access server to a powerful computing node. Because storage is shared, your results will be waiting for you in your home folder once the job finishes.

By following this workflow, you keep the system stable for your peers while ensuring your research is reproducible and isolated from conflicting software.