Welcome to RQL's Blog!

Do Better Every Day!

linux任务投递系统

2024-07-10

renql

linux

shell

sge任务调度系统
sbatch任务投递系统

sge任务调度系统

任务状态：
qw 表示等待状态
Eqw 投递任务出错
r 表示任务正在运行
dr 节点挂了之后，删除任务就会出现这个状态，只有节点重启之后，任务才会消失

qdel -u username ：删除某用户名下的所有任务
qstat -u \* ：查看所有用户的任务
for ((i=316893; i<=316907; i++)); do rm LW_OC.chr*.o${i}; done 删除文件

#!/bin/bash
#$ -N meth_matrix  #任务名称
#$ -pe smp  4  # 作业请求使用一个具有共享内存并行（smp）环境的计算节点，并指定作业可以使用该节点上的 4 个处理器（或核心）进行并行计算。
#$ -l mem_free=20G # 作业请求了至少 20GB 的可用内存
#$ -l vf=20G # 作业请求了 10GB 的虚拟内存
#$ -cwd # 作业将在提交时的当前工作目录中执行
#$ -j y
#$ -o ~/stdout/

虚拟内存是指系统为了扩展物理内存而使用的一种技术，通常是物理内存加上硬盘空间。这意味着即使系统中没有足够的物理内存，作业也可以通过虚拟内存来执行，但性能可能会受到影响。

sbatch任务投递系统

sbatch job.sh

sinfo # 查看资源状态
# idel为空闲，mix为节点部分核心可以使用，alloc为已被占用; 
# down表示该节点当前不可用，通常是由于硬件故障、网络问题或其他维护需要。节点处于这种状态时，无法分配给新的作业。

squeue # 查看作业队列

scancel 123 # 取消作业ID为123的作业
scancel -t PENDING -u `whoami` # 取消自己账号下所有状态为pending的作用

job.sh的脚本可以这样编写

#!/bin/bash
#SBATCH -J job.$sample # 作业名
#SBATCH -o job.$sample.%j.out 
#SBATCH --exclude=node8
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=20
#SBATCH --cpus-per-task=1
#SBATCH --mem=6G

mpirun -np 100 python test.py

Read All

根据行政区划mask数据

2022-08-29

renql

python ncl

mask

ncl
python
参考

shapefile，是美国环境系统研究所公司（ESRI）开发的空间数据开放格式，用于描述几何体对象：点、折线与多边形。可以存储井、河流、湖泊等空间对象的几何位置。除了几何位置，shp文件也可以存储这些空间对象的属性，例如河流的名字、城市的温度等等。

Shapefile文件指的是一种文件存储的方法，实际上该种文件格式是由多个文件组成的。其中三个文件必不可少（”.shp”, “.shx”与 “.dbf”）。表示同一数据的一组文件其文件名前缀应该相同。其中“真正”的Shapefile的后缀为shp，然而仅有这个文件数据是不完整的，必须要把其他两个附带上才能构成一组完整的地理数据。所有文件必须位于同一目录之中。

.shp — 图形格式，用于保存元素的几何实体。
.shx — 图形索引格式。几何体位置索引，记录每一个几何体在shp文件之中的位置，能够加快向前或向后搜索一个几何体的效率。
.dbf — 属性数据格式，以dBase III+ 的数据表格式存储每个几何形状的属性数据。

ncl

ncl可以用addfile读取shp文件，示例如下

	river                          = True
    river@gsLineThicknessF         = mp_thick+1.0
    river@gsLineColor              = "black"
	plotrv = gsn_add_shapefile_polylines(wks,plot,"/home/river.nc",river)

load "./shapefile_utils.ncl"
begin
  print_shapefile_info("gadm36_CHN_2.shp") #
  plot_shapefile("gadm36_CHN_2.shp") #

  f = addfile("gadm36_CHN_2.shp", "r")   ; Open shapefile
  segments = f->segments
  geometry = f->geometry
  segsDims = dimsizes(segments) ;2D
  geomDims = dimsizes(geometry) ;2D

  geom_segIndex = f@geom_segIndex ;=0
  geom_numSegs  = f@geom_numSegs  ;=1
  segs_xyzIndex = f@segs_xyzIndex ;=0
  segs_numPnts  = f@segs_numPnts  ;=1

  lines       = new(segsDims(0),graphic)   ; Array to hold polygons
  numFeatures = geomDims(0) ;=dimsizes(NAME_0)

  lon   = f->x
  lat   = f->y
  name  = f->NAME_2

  segNum = 0
  do i=0, numFeatures-1  
     startSegment = geometry(i, geom_segIndex)
     numSegments  = geometry(i, geom_numSegs)
     print(name(i)+" startSegment:"+startSegment+" numSegments:"+numSegments)
     do seg=startSegment, startSegment+numSegments-1
        startPT = segments(seg, segs_xyzIndex)
        endPT = startPT + segments(seg, segs_numPnts) - 1
        print("startPT:"+startPT+" endPT:"+endPT) 
        ;+" lon(startPT:endPT):"+lon(startPT:endPT)+"lat(startPT:endPT):"+lat(startPT:endPT))
        segNum = segNum + 1
     end do
  end do
end

python

有一个叫regionmask的程序包，可以实现该功能。其函数介绍页面 https://regionmask.readthedocs.io/en/stable/api.html

import cmaps, regionmask
import geopandas as gpd

ds = xr.open_dataset('~/data/t2m.nc')
var = ds['t2m'] # 三维数组

shp_file = '~/data/gadm36_CHN_0.shp'
shpf = gpd.read_file(shp_file) 
mask = regionmask.mask_geopandas(shpf, ds)
var = var.where(mask==0.0).mean(('lat','lon')) # 虽然var和mask的维数不同，但不影响
# ds也可以换成var，只要是包含经纬度信息的xarray数据就可以
# 且默认ds中经纬度名字为lat，lon，若不是，需要指定，此时语法是
mask = regionmask.mask_geopandas(shpf, ds, 'longitude', 'latitude')
# 也可以在shpf后直接跟lon和lat数组,且该数组可以是一维或二维，可以是xarray或array
mask = regionmask.mask_geopandas(shpf, ds['longitude'], ds['latitude'])
# 无论哪种方法，都只返回一个二维数组xarray.DataArray mask，其中不位于shp_file定义的区域内的格点为nan
# 由于此处shp_file只定义了一个区域，所以在区域中的格点为0
# 若shp_file有多个区域，则mask中的值从0-len(区域)

# regionmask中已包含了许多常见的区域划分，可直接调用，不需要使用shapefile
# natural_earth_v5_0_0.countries_110中包含177个区域，其中China的编号是139
region = regionmask.defined_regions.natural_earth_v5_0_0.countries_110
mask = region.mask(var) #输入经纬度信息的方法和返回数组信息同regionmask.mask_geopandas
var = var.where(mask==139)

参考

https://zh.m.wikipedia.org/zh-hans/Shapefile
https://www.ncl.ucar.edu/Applications/shapefiles.shtml
https://regionmask.readthedocs.io/en/stable/notebooks/geopandas.html

Read All

python中的nan和inf

2022-07-26

renql

python

缺测 numpy xarray

运算函数
设置缺测 masking

nan: not a number
inf: 无穷大

xarray的DataArray的求和求平均函数会自动忽略 nan。但若有 inf，则求和与求平均的结果都是 inf。
numpy中大多数函数都不会忽略 nan 和 inf。若数组中有 nan 和 inf, 求和求平均的结果都是 nan or inf。举例如下

运算函数

a = np.array([1,2,3,np.nan,4,5])
print('%f, %f'%(a.min(),a.max())) # 结果为 nan，nan
print('%f, %f'%(np.nanmin(a),np.nanmax(a))) # 结果为1，5
print('%f, %f'%(a.sum(),a.mean())) # 结果也是 nan，nan
print('%f, %f'%(np.nansum(a),np.nanmean(a))) # 结果为 15， 3

a = np.array([1,2,np.nan,np.inf,4,5])
print('%f, %f'%(a.min(),a.max())) # 结果为 nan，nan
print('%f, %f'%(np.nanmin(a),np.nanmax(a))) # 结果 1, inf
print('%f, %f'%(a.sum(),a.mean())) # 结果 nan，nan
print('%f, %f'%(np.nansum(a),np.nanmean(a))) # 结果 inf， inf

a = np.array([1,2,3,6,4,5])
a = np.ma.array(a,mask=(a==5)) # 结果 [1.0 2.0 3.0 6.0 4.0 --]
print('%f, %f'%(a.sum(),a.mean())) # 结果 16.000000, 3.200000
print('%f, %f'%(np.nansum(a),np.nanmean(a))) # 结果同上

# 使用scipy.stats.计算相关系数时，不能有缺测存在，因此需要替换缺测  
# 可以自己指定将nan，np.inf替换为某个具体数值，或者使用默认的数值  
new = np.nan_to_num(a, nan=0, posinf=999, neginf=-999)

np.sqrt(a), 若a是实数且a中有负数，则负数的计算值为nan，若a是复数且有负数，则负数的计算结果为复数

设置缺测 masking

a = np.array([1,2,np.nan,np.inf,4,5])
b = np.ma.array(a,mask=(a==4))
# 得到的b是
# masked_array(data=[1.0, 2.0, nan, inf, --, 5.0],
#             mask=[False, False, False, False,  True, False],
#       fill_value=1e+20)


da = xr.DataArray(np.arange(16).reshape(4, 4), dims=["x", "y"])
da.where(da.x + da.y < 4) # 保留满足条件的变量，未满足条件的变量设为nan
# da.where(da.x + da.y < 4，0)，会将未满足条件的变量设为0
# output is 
# <xarray.DataArray (x: 4, y: 4)>
# array([[ 0.,  1.,  2.,  3.],
#        [ 4.,  5.,  6., nan],
#        [ 8.,  9., nan, nan],
#        [12., nan, nan, nan]])
# Dimensions without coordinates: x, y

Read All

reference count问题

2022-07-25

renql

python

易犯错误
最近发现 python 一个非常操蛋的内存存储问题，好像叫 reference count，一不留心就容易计算失误。
```
import numpy as np
def diff(a):
    b=a 
    b[1:4]=a[1:4]-a[2:]
    return b
a=np.array([0,2,1,3,4])
c=diff(a)
print(a) # array([ 0,  1, -2, -1,  4])
print(c) # array([ 0,  1, -2, -1,  4])
```
可以看到变量a也发生了变化。问题在于b=a这个命令在python中不属于赋值，只是为a多增加了一个索引，a和b都指向同一个内存地址。因此改变b的同时，a也会发生变化。 b=a[0:2]numpy数组的切片赋值也存在类似的问题。但列表的切片赋值不存在这个问题。

避免这个问题的解决方法是b=a.copy()。或者a进行简单运算后再赋值给b，此时a与b指向不同的内存地址，如下例所示。
```
b=a/1.0 # 
b[0]=2
print(b) # array([ 2.,  1., -2., -1.,  4.])
print(a) # array([ 0,  1, -2, -1,  4])
```
xarray中也存在类似的问题，解决方法b=foo[1:4].data.copy()。
```
import xarray as xr
a=np.array([0,2,1,3,4])
foo = xr.DataArray(a, dims=["time"])

b=foo[1:4].data
b[0]=10

print(foo) # foo也发生了变化
# <xarray.DataArray (time: 5)>
# array([ 0, 10,  1,  3,  4])
# Dimensions without coordinates: time

print(b)
# [10  1  3]
```
Read All

数组矩阵

2022-06-25

renql

python

numpy, pandas

创建数组
数据类型转换
数学运算
逻辑运算
文本文件操作

numpy主要用于数组与矩阵运算，主要数据结构就是N维同类型数组array，可以用array.shape, array.dtype查看数组结构和类型

创建数组

a = np.array([1,2,3],dtype=float)   
b = np.zeros(a.shape)  
c = np.zeros( [len(lev),len(behv),len(months),len(year)],dtype=float )
x = np.arange(6).reshape(2,3) + 10
# 得到数组 array([[10, 11, 12],[13, 14, 15]])


ind = np.argwhere(x>11)
# 得到一个二维数组，表示满足条件的位置坐标
# array([[0, 2],[1, 0],[1, 1],[1, 2]])
# 此时 x[ind[:,0],ind[:,1]] 为 array([12, 13, 14, 15])


ind = np.unravel_index(np.argmax(x, axis=None), x.shape)
# 得到的ind 为（1，2), x[ind] = 15
# 若直接 np.argmax(x, axis=None)，得到的结果是5
# 若 np.argmax(x, axis=0),得到的结果 array([1, 1, 1])

数据类型转换

numpy数组中的所有元素都必须是同一个类型，无法像pandas.dataframe那样可以一列是字符串，一列是bool值，一列是浮点数。若提取有多种数据类型的pandas.dataframe.values为numpy.array时，数据类型会变为 Object。

dat[:,0].tolist() # 转换为列表
data[:,1:].astype(np.float16) # 将Object类型转换为浮点数类型

数学运算

两个不同维度的数组（a/b）相除，只有当b的维数与a的最后几个维数相等时才可以相除

a=np.arange(24).reshape((4, 3, 2))
b=np.arange(6).reshape((3, 2))
print(a/b) 
#array([[[ nan,  1. ],[ 1. ,  1. ],[ 1. ,  1. ]],
#       [[ inf,  7. ],[ 4. ,  3. ],[ 2.5,  2.2]],
#       [[ inf, 13. ],[ 7. ,  5. ],[ 4. ,  3.4]],
#       [[ inf, 19. ],[10. ,  7. ],[ 5.5,  4.6]]])
#

b=np.array([1,2,3,4])
c=a/b # 报错
# ValueError: operands could not be broadcast together with shapes (4,3,2) (4,)

# 此时可以将a中的第一维度移动到最后，这样就可以和b相除
c = np.moveaxis(a,0,-1)/b # c.shape 等于 (3, 2, 4)
d = np.moveaxis(c,-1,0)   # d.shape 等于 (4, 3, 2)

# 也可以将b变成和a一样维数的数组，然后再相除，得到的结果一样
d = a/np.broadcast_to(b.reshape(4, 1, 1), a.shape)

逻辑运算

a=np.array([tim3.time.dt.year.isin(1990),tim3.time.dt.month.isin(2)])
print(a)
print(a.all(axis=0))
# output:
#[[ True  True False]
# [False  True False]]
#
#[False  True False]


new = np.where(a<5, a, 10*a)
# 原数组为 array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# 新数组为 array([0, 1, 2, 3, 4, 50, 60, 70, 80, 90])])

a = xr.DataArray(np.arange(25).reshape(5, 5), dims=("x", "y"))
new = xr.where(a<5, a, 10*a) # 用法通np.where

# 但 xarray.DataArray.where 的用法和 xr.where 不同
a.where(a.x + a.y < 5, -1) # 将不满足条件的值设为-1，若不给定，则会设为缺测
a.where(a.x + a.y < 5, drop=True) # 将不满足条件的值设为缺测，并删去都是缺测的列或行
# a.x + a.y < 5返回逻辑数组，因此也可以用 a.x.isin([2,3]) 代替

# numpy 也有 isin的用法，但格式和 xarray 不一样
b = np.array([range(5,15,1),range(0,10,1)])
mask = np.isin(b, range(5,10))
# mask是一个shape同b的逻辑数组，
# b[mask]可以得到 array([5, 6, 7, 8, 9, 5, 6, 7, 8, 9])
mask_invert = np.isin(b, range(5,10), invert=True)
# mask_invert是mask的逻辑反面，即mask为True的地方，mask_invert为False

np.isnan(matrix).any() # 判断数组中是否存在缺测

文本文件操作

numpy.savetxt(fname,X,fmt='%.18e')
# 第一个参数为文件名，第二个参数为需要存的数组（一维或者二维）
# 数组的第一个维度变为行，第二个维度变为列
# 这个写入功能用得比较少，感觉没有python自带的写入功能好用
# 详情可参考 https://renqlsysu.github.io/2021/10/04/python_file/

data = numpy.loadtxt(fname,dtype=<class 'float'>, 
	comments='#', delimiter=None, usecols = (0,1))
# 将数据读出为array类型，若文件有ny行，nx列，得到的数组维度是(ny,nx)
# 可以用skiprows跳过前几行，用usecols指定读取的列（从0列开始）

Read All

Matplotlib colormap

2022-06-23

renql

python

plt

已定义色标
自定义色标
Colormap Normalization

已定义色标

Matplotlib有很多自带的色标，但感觉都比较丑，作为用惯ncl的人，还是觉得ncl的色标比较舒适，因此网上有大神写了一个ncl色标的python库cmaps。

from matplotlib import cm
import cmaps

ncmap  = cm.viridis # <matplotlib.colors.ListedColormap object at 0x2b6b1578b550>
ncmap  = cm.get_cmap('viridis',17) # 从viridis中取出17个颜色，此时ncmap.N是17
ncolor = cm.viridis.N #会现实该色标的颜色数量，256

ncmap = cmaps.precip3_16lev # <cmaps.colormap.Colormap object at 0x2b6b0abdceb0>，
# 也可以用ncmap.N统计其颜色数量，该色标有17个颜色
# cmaps.BlueDarkRed18.N的颜色有18个颜色

#可以通过以下方法获得colormap的RGBA值，一个Nx4的array数组，N是该colormap的颜色数量，4代表各个颜色的RGBA值
data = ncmap.colors 
data = ncmap(range(8)) #取前8个颜色
data = ncmap(np.linspace(0, 1, 8)) #显示线性插值的8个颜色

自定义色标

from matplotlib.colors import ListedColormap
ncmap = ListedColormap(["blue","green","red"])


# 合并两个色标
newcolors = np.vstack((cm.Oranges_r(np.linspace(0, 1, 128)),
                       cm.Blues(np.linspace(0, 1, 128))))
newcmp = ListedColormap(newcolors, name='OrangeBlue')


# 将已定义的色标转置
ncmap = ListedColormap(cmaps.precip3_16lev(range(0,17,1))[::-1])


# 获得较为稀疏的已定义坐标
ncmap  = cm.get_cmap('viridis',17) # matplotlib自带的colormap可以这样操作
ncmap = ListedColormap(cmaps.GMT_globe(range(0,209,3)))


# 替换已定义色标中的某一颜色
newcolors = cm.viridis(np.linspace(0, 1, 256))
pink = np.array([248/256, 24/256, 148/256, 1])
newcolors[:25, :] = pink
newcmp = ListedColormap(newcolors)

Colormap Normalization

# 将值线性分布到colormap上
pcm = ax.pcolormesh(x, y, Z, vmin=-1., vmax=1., cmap='RdBu_r')


# 将值对数分布到colormap上
pcm = ax[0].pcolor(X, Y, Z,
                   norm=colors.LogNorm(vmin=Z.min(), vmax=Z.max()),
                   cmap=cmaps.BlueDarkRed18, shading='auto')


# 将值线性分布到colormap上，同时vcenter的值将位于colormap中间
pc = ax2.pcolormesh(Z, norm=colors.CenteredNorm(vcenter=0), cmap=cm.coolwarm)


# 定义离散的colormap值
# 一般我会将ncolors设为len(bounds)+1,但看这个例子好像ncolors只要大于len(bounds)就可以
import matplotlib.colors as colors
bounds = np.array([-0.2, -0.1, 0, 0.5, 1])
norm = colors.BoundaryNorm(boundaries=bounds, ncolors=cm.viridis.N, extend='both')
pcm = ax[2].pcolormesh(X, Y, Z, norm=norm, cmap=cm.viridis)


# 使值在colormap中间两侧的值的变化率不同
# make a colormap that has land and ocean clearly delineated and of the
# same length (256 + 256)
colors_undersea = plt.cm.terrain(np.linspace(0, 0.17, 256))
colors_land = plt.cm.terrain(np.linspace(0.25, 1, 256))
all_colors = np.vstack((colors_undersea, colors_land))
terrain_map = colors.LinearSegmentedColormap.from_list(
    'terrain_map', all_colors)

# make the norm:  Note the center is offset so that the land has more
# dynamic range:
divnorm = colors.TwoSlopeNorm(vmin=-500., vcenter=0, vmax=4000)

pcm = ax.pcolormesh(longitude, latitude, topo, rasterized=True, norm=divnorm,
                    cmap=terrain_map, shading='auto')

cb = fig.colorbar(pcm, shrink=0.6)
cb.set_ticks([-500, 0, 1000, 2000, 3000, 4000])

参考：

https://matplotlib.org/stable/tutorials/colors/colormaps.html#overview
https://matplotlib.org/stable/tutorials/colors/colormap-manipulation.html#sphx-glr-tutorials-colors-colormap-manipulation-py
https://matplotlib.org/stable/tutorials/colors/colormapnorms.html
https://matplotlib.org/stable/tutorials/colors/colorbar_only.html

Read All

1/23

Principle

Welcome to RQL's Blog!

linux任务投递系统

sge任务调度系统

sbatch任务投递系统

根据行政区划mask数据

ncl

python

参考

python中的nan和inf

运算函数

设置缺测 masking

reference count问题

数组矩阵

创建数组

数据类型转换

数学运算

逻辑运算

文本文件操作

Matplotlib colormap

已定义色标

自定义色标

Colormap Normalization