UNSW-NB15数据集分析

UNSW-NB15 由澳大利亚网络安全中心(ACCS)创建。

截屏2020-08-31 08.41.44

一、特征描述

These features are described in UNSW-NB15_features.csv file.

No. Name Type Description
1 srcip nominal Source IP address
2 sport integer Source port number
3 dstip nominal Destination IP address
4 dsport integer Destination port number
5 proto nominal Transaction protocol
6 state nominal Indicates to the state and its dependent protocol, e.g. ACC, CLO, CON, ECO, ECR, FIN, INT, MAS, PAR, REQ, RST, TST, TXD, URH, URN, and (-) (if not used state)
7 dur Float Record total duration
8 sbytes Integer Source to destination transaction bytes
9 dbytes Integer Destination to source transaction bytes
10 sttl Integer Source to destination time to live value
11 dttl Integer Destination to source time to live value
12 sloss Integer Source packets retransmitted or dropped
13 dloss Integer Destination packets retransmitted or dropped
14 service nominal http, ftp, smtp, ssh, dns, ftp-data ,irc and (-) if not much used service
15 Sload Float Source bits per second
16 Dload Float Destination bits per second
17 Spkts integer Source to destination packet count
18 Dpkts integer Destination to source packet count
19 swin integer Source TCP window advertisement value
20 dwin integer Destination TCP window advertisement value
21 stcpb integer Source TCP base sequence number
22 dtcpb integer Destination TCP base sequence number
23 smeansz integer Mean of the ?ow packet size transmitted by the src
24 dmeansz integer Mean of the ?ow packet size transmitted by the dst
25 trans_depth integer Represents the pipelined depth into the connection of http request/response transaction
26 res_bdy_len integer Actual uncompressed content size of the data transferred from the server抯 http service.
27 Sjit Float Source jitter (mSec)
28 Djit Float Destination jitter (mSec)
29 Stime Timestamp record start time
30 Ltime Timestamp record last time
31 Sintpkt Float Source interpacket arrival time (mSec)
32 Dintpkt Float Destination interpacket arrival time (mSec)
33 tcprtt Float TCP connection setup round-trip time, the sum of 抯ynack and 抋ckdat.
34 synack Float TCP connection setup time, the time between the SYN and the SYN_ACK packets.
35 ackdat Float TCP connection setup time, the time between the SYN_ACK and the ACK packets.
36 is_sm_ips_ports Binary If source (1) and destination (3)IP addresses equal and port numbers (2)(4) equal then, this variable takes value 1 else 0
37 ct_state_ttl Integer No. for each state (6) according to specific range of values for source/destination time to live (10) (11).
38 ct_flw_http_mthd Integer No. of flows that has methods such as Get and Post in http service.
39 is_ftp_login Binary If the ftp session is accessed by user and password then 1 else 0.
40 ct_ftp_cmd integer No of flows that has a command in ftp session.
41 ct_srv_src integer No. of connections that contain the same service (14) and source address (1) in 100 connections according to the last time (26).
42 ct_srv_dst integer No. of connections that contain the same service (14) and destination address (3) in 100 connections according to the last time (26).
43 ct_dst_ltm integer No. of connections of the same destination address (3) in 100 connections according to the last time (26).
44 ct_src_ ltm integer No. of connections of the same source address (1) in 100 connections according to the last time (26).
45 ct_src_dport_ltm integer No of connections of the same source address (1) and the destination port (4) in 100 connections according to the last time (26).
46 ct_dst_sport_ltm integer No of connections of the same destination address (3) and the source port (2) in 100 connections according to the last time (26).
47 ct_dst_src_ltm integer No of connections of the same source (1) and the destination (3) address in in 100 connections according to the last time (26).
48 attack_cat nominal The name of each attack category. In this data set , nine categories e.g. Fuzzers, Analysis, Backdoors, DoS Exploits, Generic, Reconnaissance, Shellcode and Worms
49 Label binary 0 for normal and 1 for attack records

二、Train & Test

A partition from this dataset is configured as a training set and testing set, namely, [UNSW_NB15_training-set.csv](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/a part of training and testing set/UNSW_NB15_training-set.csv) and [UNSW_NB15_testing-set.csv](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/a part of training and testing set/UNSW_NB15_testing-set.csv) respectively.

The number of records in the training set is 175,341 records and the testing set is 82,332 records from the different types, attack and normal.Figure 1 and 2 show the testbed configuration dataset and the method of the feature creation of the UNSW-NB15, respectively.

1 特征类型

1
2
3
4
5
6
7
8
9
10
# df_train.columns
Index(['id', 'dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes',
'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss',
'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin',
'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth',
'response_body_len', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm',
'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm',
'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm',
'ct_srv_dst', 'is_sm_ips_ports', 'attack_cat', 'label'],
dtype='object')

2 分类统计

ID Type Count Train(drop_duplicates) Test(drop_duplicates)
0 Normal 93000 56000(51890) 37000(34206)
1 Generic 58871 40000(4181) 18871(3657)
2 Exploits 44525 33393(19844) 11132(7609)
3 Fuzzers 24246 18184(16150) 6062(4838)
4 DoS 16353 12264(3806) 4089(1718)
5 Reconnaissance 13987 10491(7522) 3496(2703)
6 Analysis 2677 2000(1594) 677(446)
7 Backdoor 2329 1746(1535) 583(346)
8 Shellcode 1511 1133(1091) 378(378)
9 Worms 174 130(127) 44(44)
Total 257673 175341 82332

3 异常值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# NaN, Duplicated, Inf
print('#train')
print('----------缺失值-----------')
df_train = df_train.replace([np.inf, -np.inf], np.nan)
print(df_train.isnull().any().value_counts())
print('----------重复值-----------')
print(df_train.duplicated().value_counts())


print('\n')

print('#test')
print('----------缺失值-----------')
df_test = df_test.replace([np.inf, -np.inf], np.nan)
print(df_test.isnull().any().value_counts())
print('----------重复值-----------')
print(df_test.duplicated().value_counts())

#####################################################

#train
----------缺失值-----------
False 43
dtype: int64
----------重复值-----------
False 107740
True 67601
dtype: int64


#test
----------缺失值-----------
False 43
dtype: int64
----------重复值-----------
False 55945
True 26387
dtype: int64