Device for Fingerprinting HTTP requests of malware. Primarily based on Tshark and written in Python3. Working prototype stage 🙂
Its essential goal is to supply distinctive representations (fingerprints) of malware requests, which assist in their identification. Distinctive means right here that every fingerprint must be seen solely in a single specific malware household, but one household can have a number of fingerprints. Hfinger represents the request in a shorter type than printing the entire request, however nonetheless human interpretable.
Hfinger can be utilized in guide malware evaluation but in addition in sandbox programs or SIEMs. The generated fingerprints are helpful for grouping requests, pinpointing requests to specific malware households, figuring out totally different operations of 1 household, or discovering unknown malicious requests omitted by different safety programs however which share fingerprint.
An educational paper accompanies work on this device, describing, for instance, the motivation of design decisions, and the analysis of the device in comparison with p0f, FATT, and Mercury.
The thought
The fundamental assumption of this venture is that HTTP requests of various malware households are kind of distinctive, to allow them to be fingerprinted to supply some form of identification. Hfinger retains details about the construction and values of some headers to supply means for additional evaluation. For instance, grouping of comparable requests – at this second, it’s nonetheless a piece in progress.
After evaluation of malware’s HTTP requests and headers, we’ve got recognized some components of requests as being most distinctive. These embody: * Request methodology * Protocol model * Header order * Fashionable headers’ values * Payload size, entropy, and presence of non-ASCII characters
Moreover, some commonplace options of the request URL had been additionally thought of. All these components had been translated right into a set of options, described in particulars right here.
The above options are translated into various size illustration, which is the precise fingerprint. Relying on report mode, totally different options are used to fingerprint requests. Extra data on these modes is introduced under. The function choice course of will probably be described within the forthcoming tutorial paper.
Set up
Minimal necessities wanted earlier than set up: * Python
>= 3.3, * Tshark
>= 2.2.0.
Set up obtainable from PyPI:
pip set up hfinger
Hfinger has been examined on Xubuntu 22.04 LTS with tshark
bundle in model 3.6.2
, however ought to work with older variations like 2.6.10
on Xubuntu 18.04 or 3.2.3
on Xubuntu 20.04.
Please be aware that as with all PoC, it is best to run Hfinger in a separated setting, no less than with Python digital setting. Its setup shouldn’t be lined right here, however you may attempt this tutorial.
Utilization
After set up, you may name the device straight from a command line with hfinger
or as a Python module with python -m hfinger
.
For instance:
foo@bar:~$ hfinger -f /tmp/take a look at.pcap
[1]
Assist will be displayed with brief -h
or lengthy --help
switches:
utilization: hfinger [-h] (-f FILE | -d DIR) [-o output_path] [-m {0,1,2,3,4}] [-v]
[-l LOGFILE]Hfinger - fingerprinting malware HTTP requests saved in pcap recordsdata
non-obligatory arguments:
-h, --help present this assist message and exit
-f FILE, --file FILE Learn a single pcap file
-d DIR, --directory DIR
Learn pcap recordsdata from the listing DIR
-o output_path, --output-path output_path
Path to the output listing
-m {0,1,2,3,4}, --mode {0,1,2,3,4}
Fingerprint report mode.
0 - related variety of collisions and fingerprints as mode 2, however utilizing fewer options,
1 - illustration of all designed options, however a bit of extra collisions than modes 0, 2, and 4,
2 - optimum (the default mode),
3 - the bottom variety of generated fingerprints, however the highest variety of collisions,
4 - the best fingerprint entropy, however barely extra fingerprints than modes 0-2
-v, --verbose Report details about non-standard values within the request
(e.g., non-ASCII characters, no CRLF tags, values not current within the configuration checklist).
With out --logfile (-l) will print to the usual error.
-l LOGFILE, --logfile LOGFILE
Output logfile within the verbose mode. Implies -v or --verbose change.
You need to present a path to a pcap file (-f), or a listing (-d) with pcap recordsdata. The output is in JSON format. It will likely be printed to plain output or to the supplied listing (-o) utilizing the identify of the supply file. For instance, output of the command:
hfinger -f instance.pcap -o /tmp/pcap
will probably be saved to:
/tmp/pcap/instance.pcap.json
Report mode -m
/--mode
can be utilized to alter the default report mode by offering an integer within the vary 0-4
. The modes differ on represented request options or rounding modes. The default mode (2
) was chosen by us to characterize all options which are normally used throughout requests’ evaluation, but it surely additionally gives low variety of collisions and generated fingerprints. With different modes, you may obtain totally different objectives. For instance, in mode 3
you get a decrease variety of generated fingerprints however the next probability of a collision between malware households. If you’re not sure, you do not have to alter something. Extra data on report modes is right here.
Starting with model 0.2.1
Hfinger is much less verbose. You must use -v
/--verbose
if you wish to obtain details about encountered non-standard values of headers, non-ASCII characters within the non-payload a part of the request, lack of CRLF tags (rnrn
), and different issues with analyzed requests that aren’t software errors. When any such points are encountered within the verbose mode, they are going to be printed to the usual error output. You may as well save the log to an outlined location utilizing -l
/--log
change (it implies -v
/--verbose
). The log knowledge will probably be appended to the log file.
Utilizing hfinger in a Python software
Starting with model 0.2.0
, Hfinger helps importing to different Python functions. To make use of it in your app merely import hfinger_analyze
operate from hfinger.evaluation
and name it with a path to the pcap file and reporting mode. The returned result’s an inventory of dicts with fingerprinting outcomes.
For instance:
from hfinger.evaluation import hfinger_analyzepcap_path = "SPECIFY_PCAP_PATH_HERE"
reporting_mode = 4
print(hfinger_analyze(pcap_path, reporting_mode))
Starting with model 0.2.1
Hfinger makes use of logging
module for logging details about encountered non-standard values of headers, non-ASCII characters within the non-payload a part of the request, lack of CRLF tags (rnrn
), and different issues with analyzed requests that aren’t software errors. Hfinger creates its personal logger utilizing identify hfinger
, however with out prior configuration log data in observe is discarded. If you wish to obtain this log data, earlier than calling hfinger_analyze
, it is best to configure hfinger
logger, set log degree to logging.INFO
, configure log handler as much as your wants, add it to the logger. Extra data is offered within the hfinger_analyze
operate docstring.
Fingerprint creation
A fingerprint relies on options extracted from a request. Utilization of specific options from the complete checklist relies on the chosen report mode from a predefined checklist (extra data on report modes is right here). The determine under represents the creation of an exemplary fingerprint within the default report mode.
Three components of the request are analyzed to extract data: URI, headers’ construction (together with methodology and protocol model), and payload. Specific options of the fingerprint are separated utilizing |
(pipe). The ultimate fingerprint generated for the POST
request from the instance is:
2|3|1|php|0.6|PO|1|us-ag,ac,ac-en,ho,co,co-ty,co-le|us-ag:f452d7a9/ac:as-as/ac-en:id/co:Ke-Al/co-ty:te-pl|A|4|1.4
The creation of options is described under within the order of look within the fingerprint.
Firstly, URI options are extracted: * URI size represented as a logarithm base 10 of the size, rounded to an integer, (within the instance URI is 43 characters lengthy, so log10(43)≈2
), * variety of directories, (within the instance there are 3 directories), * common listing size, represented as a logarithm with base 10 of the particular common size of the listing, rounded to an integer, (within the instance there are three directories with whole size of 20 characters (6+6+8), so log10(20/3)≈1
), * extension of the requested file, however solely whether it is on an inventory of identified extensions in hfinger/configs/extensions.txt
, * common worth size represented as a logarithm with base 10 of the particular common worth size, rounded to 1 decimal level, (within the instance two values have the identical size of 4 characters, what is clearly equal to 4 characters, and log10(4)≈0.6
).
Secondly, header construction options are analyzed: * request methodology encoded as first two letters of the tactic (PO
), * protocol model encoded as an integer (1 for model 1.1, 0 for model 1.0, and 9 for model 0.9), * order of the headers, * and common headers and their values.
To characterize order of the headers within the request, every header’s identify is encoded in response to the schema in hfinger/configs/headerslow.json
, for instance, Consumer-Agent
header is encoded as us-ag
. Encoded names are separated by ,
. If the header identify doesn’t begin with an higher case letter (or any of its components when analyzing compound headers equivalent to Settle for-Encoding
), then encoded illustration is prefixed with !
. If the header identify shouldn’t be on the checklist of the identified headers, it’s hashed utilizing FNV1a hash, and the hash is used as encoding.
When analyzing common headers, the request is checked if they seem in it. These headers are: * Connection * Settle for-Encoding * Content material-Encoding * Cache-Management * TE * Settle for-Charset * Content material-Sort * Settle for * Settle for-Language * Consumer-Agent
When the header is discovered within the request, its worth is checked in opposition to a desk of typical values to create pairs of header_name_representation:value_representation
. The identify of the header is encoded in response to the schema in hfinger/configs/headerslow.json
(as introduced earlier than), and the worth is encoded in response to schema saved in hfinger/configs
listing or configs.py
file, relying on the header. Within the above instance Settle for
is encoded as ac
and its worth */*
as as-as
(asterisk-asterisk
), giving ac:as-as
. The pairs are inserted into fingerprint so as of look within the request and are delimited utilizing /
. If the header worth can’t be discovered within the encoding desk, it’s hashed utilizing the FNV1a hash.
If the header worth consists of a number of values, they’re tokenized to supply an inventory of values delimited with ,
, for instance, Settle for: */*, textual content/*
would give ac:as-as,te-as
. Nevertheless, at this level of growth, if the header worth comprises a “high quality worth” tag (q=
), then the entire worth is encoded with its FNV1a hash. Lastly, values of Consumer-Agent and Settle for-Language headers are straight encoded utilizing their FNV1a hashes.
Lastly, within the payload options: * presence of non-ASCII characters, represented with the letter N
, and with A
in any other case, * payload’s Shannon entropy, rounded to an integer, * and payload size, represented as a logarithm with base 10 of the particular payload size, rounded to 1 decimal level.
Report modes
Hfinger
operates in 5 report modes, which differ in options represented within the fingerprint, thus data extracted from requests. These are (with the quantity used within the device configuration): * mode 0
– producing an analogous variety of collisions and fingerprints as mode 2
, however utilizing fewer options, * mode 1
– representing all designed options, however producing a bit of extra collisions than modes 0
, 2
, and 4
, * mode 2
– optimum (the default mode), representing all options that are normally used throughout requests’ evaluation, but in addition providing a low variety of collisions and generated fingerprints, * mode 3
– producing the bottom variety of generated fingerprints from all modes, however attaining the best variety of collisions, * mode 4
– providing the best fingerprint entropy, but in addition producing barely extra fingerprints than modes 0
–2
.
The modes had been chosen in an effort to optimize Hfinger’s capabilities to uniquely establish malware households versus the variety of generated fingerprints. Modes 0
, 2
, and 4
supply an analogous variety of collisions between malware households, nonetheless, mode 4
generates a bit of extra fingerprints than the opposite two. Mode 2
represents extra request options than mode 0
with a comparable variety of generated fingerprints and collisions. Mode 1
is the one one representing all designed options, but it surely will increase the variety of collisions by virtually two instances evaluating to modes 0
, 1
, and 4
. Mode 3
produces no less than two instances fewer fingerprints than different modes, but it surely introduces about 9 instances extra collisions. Description of all designed options is right here.
The modes encompass options (within the order of look within the fingerprint): * mode 0
: * variety of directories, * common listing size represented as an integer, * extension of the requested file, * common worth size represented as a float, * order of headers, * common headers and their values, * payload size represented as a float. * mode 1
: * URI size represented as an integer, * variety of directories, * common listing size represented as an integer, * extension of the requested file, * variable size represented as an integer, * variety of variables, * common worth size represented as an integer, * request methodology, * model of protocol, * order of headers, * common headers and their values, * presence of non-ASCII characters, * payload entropy represented as an integer, * payload size represented as an integer. * mode 2
: * URI size represented as an integer, * variety of directories, * common listing size represented as an integer, * extension of the requested file, * common worth size represented as a float, * request methodology, * model of protocol, * order of headers, * common headers and their values, * presence of non-ASCII characters, * payload entropy represented as an integer, * payload size represented as a float. * mode 3
: * URI size represented as an integer, * common listing size represented as an integer, * extension of the requested file, * common worth size represented as an integer, * order of headers. * mode 4
: * URI size represented as a float, * variety of directories, * common listing size represented as a float, * extension of the requested file, * variable size represented as a float, * common worth size represented as a float, * request methodology, * model of protocol, * order of headers, * common headers and their values, * presence of non-ASCII characters, * payload entropy represented as a float, * payload size represented as a float.