Signs of Triviality

Opinions, mostly my own, on the importance of being and other things.
[homepage]  [blog]  [jschauma@netmeister.org]  [@jschauma]  [RSS]

WHOIS: Fragile, unparseable, obsolete... and universally relied upon

January 9th, 2022

The WHOIS protocol is one of the older internet protocols around. It's infuriatingly simple, by and large considered obsolete, and the data provided by it unpredictable, unreliable, incomplete, and, of course, still one of the corner stones of internet operations. In other words, it's the kind of thing I like to waste my time on trying to understand.

Originally set up in the 1970s at the Stanford Research Institute Network Information Center (aka SRI-NIC) by the mother of the DNS and overall ARPANET boss Elizabeth J. Feinler, WHOIS was first described in RFC812 (1982). Based on the FINGER protocol, it was as dead simple as you could imagine:

Connect to the service host (SRI-NIC)
   TCP: service port 43 decimal
   NCP: ICP to socket 43 decimal, establishing two 8-bit connections

Send a single "command line", ending with <CRLF>.

Receive information in response to the command line. 

Yep, that was it. And that's still the full protocol specification (now RFC3912 (2004)). Here, give it a try:

$ telnet whois.iana.org 43
Trying 2620:0:2d0:200::59...
Connected to ianawhois.vip.icann.org.
Escape character is '^]'.
org
% IANA WHOIS server
% for more information on IANA, visit http://www.iana.org
% This query returned 1 object

domain:       ORG
[...]

Congratulations - you just spoke WHOIS!

The data you get back is intentionally not structured and is designed to be human-, not machine-readable (more on that a bit below). It was originally intended to provide contact information including "mailing address, telephone number, and network mailbox" "for ARPANET users" like so:

Command line: dyer
Response:
   Dyer, David A. (DAD2)   DDYER@USC-ISIB  (213) 822-1511
   Dyer, Fred S. (FSD)  Dyer@RADC-MULTICS  (315) 330-7275
   Dyer, Mary K. (MARY)   DYER@SRI-NIC     (415) 859-4775
   Dyer, William R. (WRD)   WRDyer@RADC-MULTICS  (315) 330-7791

Command line: mary
Response:
   Dyer, Mary K. (MARY)          DYER@SRI-NIC
   SRI International
   Network Information Center
   Telecommunications Sciences Center
   333 Ravenswood Avenue
   Menlo Park, California 94025
   Phone: (415) 859-4775

And you thought the DNS was the phonebook of the internet...

How to find the responsible WHOIS server

When the internet grew too large for SRI-NIC to continue functioning as the global phonebook, and eventually with the transfer of the operation of the DNS root to ICANN, WHOIS also became decentralized. Information about the various (and increasing number of) TLDs was provided logically by the Regional Internet Registries (RIRs), registries, and registrars. Some of them run a so-called "thick" server, which provides all the information; others are "thin" servers, only providing the information of the WHOIS server that does have the full information. Different TLDs, for example, may operate in either mode, but the protocol does not provide any means to differentiate the two. In other words: if you wanted to find out information about a domain, you'd have to know who the responsible registry is to ask them.

How do you know what WHOIS server to query for a given domain? Well, you just gotta know. There's no standardized way. Some domains use SRV DNS records as suggested in this internet draft:

$ host -t srv _nicname._tcp.co.uk
_nicname._tcp.co.uk has SRV record 0 0 43 whois.nic.uk.
$ host -t srv _nicname._tcp.arab
_nicname._tcp.arab has SRV record 10 10 0 your-dns-needs-immediate-attention.arab.
$ host -t srv _nicname._tcp.cpa
_nicname._tcp.cpa has SRV record 10 10 0 your-dns-needs-immediate-attention.cpa.
$ host -t srv _nicname._tcp.music
_nickname._tcp.music has SRV record 10 10 0 your-dns-needs-immediate-attention.music.
$ host -t srv _nicname._tcp.xn--fiqs8s
_nicname._tcp.中国 is an alias for wildcard.cnnic.cn.
$ host -t srv _nicname._tcp.xn--fiqz9s 
_nicname._tcp.中國 is an alias for wildcard.cnnic.cn.
$ host -t srv _nicname._tcp.xn--mxtq1m
_nicname._tcp.政府 has SRV record 10 10 0 your-dns-needs-immediate-attention.政府
$ host -t srv _nicname._tcp.xn--ngbrx
_nicname.tcp.عرب has SRV record 10 10 0 your-dns-needs-immediate-attention.عرب
$ 

...but that seems to function primarily as an indicator of a TLD compromise: out of 1489 TLDs, only nic.uk has a valid entry. Instead, some TLDs use <tld>.whois-servers.net, and the "new" TLDs after 2003 are supposed to have whois.nic.<tld>; ccTLDs pretty much all do their own thing, why not. Hence, your whois(1) client likely contains some optimistic logic and a number of hardcoded RIR WHOIS servers like this:

#define	ANICHOST	"whois.arin.net"
#define	BNICHOST	"whois.registro.br"
#define	CNICHOST	"whois.corenic.net"
#define	DNICHOST	"whois.nic.mil"
#define	FNICHOST	"whois.afrinic.net"
#define	GNICHOST	"whois.nic.gov"
#define	IANAHOST	"whois.iana.org"
#define	INICHOST	"whois.networksolutions.com"
#define	LNICHOST	"whois.lacnic.net"
#define	MNICHOST	"whois.ra.net"
#define	NICHOST		"whois.crsnic.net"
#define	PDBHOST		"whois.peeringdb.com"
#define	PNICHOST	"whois.apnic.net"
#define	QNICHOST_TAIL	".whois-servers.net"
#define	RNICHOST	"whois.ripe.net"
#define	RUNICHOST	"whois.ripn.net"
[...]
/*
 * If no country is specified determine the top level domain from the query
 * If the TLD is a number, query ARIN, otherwise, use  TLD.whois-server.net.
 * If the domain does not contain '.', check to see if it is an NSI handle
 * (starts with '!') or a CORE handle (COCO-[0-9]+ or COHO-[0-9]+) or an
 * ASN (starts with AS) or IPv6 address (contains ':'). Fall back to
 * NICHOST for the non-handle and non-IPv6 case.
 */

Otherwise, if you don't know the WHOIS server to query, you can try your luck asking IANA, which runs a "thick" server for all TLDs. It should return to you the referral to the responsible WHOIS server, which you can then ask for who might be responsible for the final domain you care about:

$ echo netmeister.org | nc whois.iana.org 43 | grep refer
refer:        whois.pir.org
$ echo netmeister.org | nc whois.pir.org 43 | grep refer
$ echo netmeister.org | nc whois.pir.org 43 | grep "Registrar WHOIS Server" 
Registrar WHOIS Server: whois.gandi.net
$ echo netmeister.org | nc whois.gandi.net 43  | grep Creation
Creation Date: 2000-04-24T02:15:22Z
$ 

Notice something? When we ask IANA, we ask for "refer", but when we ask PIR, we need to ask for "Registrar WHOIS Server". This is because the WHOIS protocol does not specify the output format of the data, nor what data should be provided. At all. It's all free form, unstructured ASCII text -- if you're lucky, that is. (More on that (again) a bit below.)

Data Privacy

But what data would you expect to be found in WHOIS? Since the early days, ICANN has had a requirement for registries and registrars to provide

unrestricted and public access to accurate and
complete WHOIS information, including registrant,
technical, billing, and administrative contact
information.
ICANN Policies

This includes the actual postal address, phone numbers, and email addresses of the various contact persons or departments (see above re "phonebook"). Which of course is routinely abused by all sorts of people, including by scammers, phishers, and for general OSINT. On the other hand, Law Enforcement really wants this information to be readily available, and as a geek with at least half a dozen random domains registered, you are likely familiar with the legal requirement to keep this information up to date.

Quite obviously this poses a dilemma: the information is required by ICANN to be openly provided, but for a variety of reasons and privacy concerns, you don't want your phone number and address out there on the internet. But more than just a cosmetic concern, the ICANN requirement now does indeed conflict with modern privacy laws, such as the EU's GDPR, meaning all domains registered by European registries are in violation of either GDPR or ICANN's requirement. Fun!

(ICANN promised not to take action against violators, and registries/registrars nowadays provide redacted information to the public but promise to provide detailed information upon "legitimate requests".)

Data Format

As I noted above, the data provided via WHOIS is completely unstructured and undefined. It is intended for human consumption, and the service operator is free to decide how to display the information. Most WHOIS servers use a simple "key: value" format, but that's far from universal. Similarly, different servers use different methods to e.g., show that certain pieces of information logically belong together.

For example, consider the information returned by the different WHOIS servers involved in a simple lookup of this website:

$ whois netmeister.org
% IANA WHOIS server
% for more information on IANA, visit
% http://www.iana.org
% This query returned 1 object

refer:        whois.pir.org

domain:       ORG

organisation: Public Interest Registry (PIR)
address:      11911 Freedom Drive 10th Floor,
address:      Suite 1000
address:      Reston, VA 20190
address:      United States

contact:      administrative
name:         Director of Operations, Compliance and Customer Support
organisation: Public Interest Registry (PIR)
address:      11911 Freedom Drive 10th Floor,
address:      Suite 1000
address:      Reston, VA 20190
address:      United States
phone:        +1 703 889 5778
fax-no:       +1 703 889 5779
e-mail:       ops@pir.org

[...]

# whois.pir.org

Domain Name: NETMEISTER.ORG
Registry Domain ID: D25516943-LROR
Registrar WHOIS Server: whois.gandi.net
Registrar URL: http://www.gandi.net
Updated Date: 2021-02-20T17:59:09Z
Creation Date: 2000-04-24T02:15:22Z

[...]

# whois.gandi.net

Domain Name: netmeister.org
[...]
Registry Registrant ID: REDACTED FOR PRIVACY
Registrant Name: REDACTED FOR PRIVACY
[...]
Registry Admin ID: REDACTED FOR PRIVACY
Admin Name: REDACTED FOR PRIVACY
[...]
Registry Tech ID: REDACTED FOR PRIVACY
Tech Name: REDACTED FOR PRIVACY
[...]
>>>Last update of WHOIS database: 2022-01-09T00:16:58Z <<<

Ok, so far, so good. Different grouping, but still, reasonably easy to parse. Now compare this to the following other queries returning results from various WHOIS servers:

$ whois stevens.edu
# whois.educause.edu
Domain Name: STEVENS.EDU

Registrant:
        Stevens Institute of Technology
        Castle Point on Hudson
        Information Technology
        Hoboken, NJ 07030
        USA

Administrative Contact:
        Domain Name Administration
        Stevens Institute of Technology
        Information Technology
        Castle Point on the Hudson
        Hoboken, NJ 07030
        USA
        +1.2012165457
        webmaster@stevens.edu
[...]
$ whois nic.tg
This is JWhoisServer serving ccTLD tg
Java Whois Server 0.4.1.3    (c) 2006 - 2015 Klaus
Zerwes zero-sys.net
All rights reserved.
Copyright "NICTogo2 - http://www.nic.tg"

Domain:.............nic.tg
Registrar:..........NETMASTER SARL
Activation:.........2021-11-11
Expiration:.........2030-06-26
Status:.............Activ&eacute;
Contact Type:.......[PRIVEE]
Last Name:..........[PRIVEE]
First Name:.........[PRIVEE]
Address:............[PRIVEE]
Tel:................[PRIVEE]
Fax:................[PRIVEE]
e-mail:.............[PRIVEE]
Name Server (DB):...ns1.nic.tg
Name Server (DB):...ns2.nic.tg
$ whois norid.no
[...]
Domain Information

NORID Handle...............: NIC311D-NORID
Domain Name................: nic.no
Registrar Handle...........: REG1-NORID
Tech-c Handle..............: NH55R-NORID
Tech-c Handle..............: NS7R-NORID
DNSSEC.....................: Signed

Additional information:
Created:         2004-02-25
Last updated:    2021-02-25
$ whois jprs.jp
Domain Information: [ドメイン情報]
[Domain Name]                   JPRS.JP

[登録者名]                       株式会社日本レジストリサービス
[Registrant]                    Japan Registry Services Co.,Ltd.

[Name Server]                   ns1.jprs.jp
[Name Server]                   ns2.jprs.jp
[Name Server]                   ns3.jprs.jp
[Name Server]                   ns4.jprs.jp
[Signing Key]                   59551 8 2 (
                                F7700A9A545DD57075E545AFE2D823CB
                                90A2C9A1305E1696C61F91BEA26FA137 )

Given how useful the information in WHOIS can be, it's no surprise that there are many businesses offering proprietary services to monetize the munging of the public information into a data format that's easy to process in an automated fashion, such as in XML or JSON. As you can tell from the above examples, it's fairly obvious how the information belongs together for a human: Humans are really, really good at identifying patterns visually, and you can all look at the output and immediately see what data represents what information, but trying to convince a computer to understand all these different formats is a major PITA and exactly what these services build their profit model on.

Paying for an online service to access public data is a bit annoying, so I wrote a tool to JSONify WHOIS data: jswhois(1). This tool will attempt to turn the unstructured, human-readable output above into structured JSON as shown below:

$ jswhois stevens.edu | jq
{
  "chain": [
    "whois.iana.org",
    "whois.educause.edu"
  ],
  "query": "stevens.edu",
  "whois.educause.edu": {
    "Administrative Contact": [
      "Domain Name Administration",
      "Stevens Institute of Technology",
      "Information Technology",
      "Castle Point on the Hudson",
      "Hoboken, NJ 07030",
      "USA",
      "+1.2012165457",
      "webmaster@stevens.edu"
    ],
[...]
$ jswhois nic.tg | jq
{
  "chain": [
    "whois.iana.org",
    "whois.nic.tg"
  ],
  "query": "nic.tg",
  "whois.nic.tg": {
    "Activation": "2021-11-11",
    "Address": "[PRIVEE]",
    "Domain": "nic.tg",
    "Expiration": "2030-06-26",
    "First Name": "[PRIVEE]",
    "Last Name": "[PRIVEE]",
    "Name Server (DB)": [
      "ns1.nic.tg",
      "ns2.nic.tg"
    ],
[...]
$ jswhois norid.no | jq
{
  "chain": [
    "whois.iana.org",
    "whois.norid.no"
  ],
  "query": "norid.no",
  "whois.norid.no": {
    "Algorithm      1": "8",
    "Created": "1999-11-15",
    "DNSSEC": "Signed",
    "DS Key Tag     1": "44384",
    "Digest         1": "ac8f61c8a538d1e6dbfd98fd86d788b0222994a8842ebabc0df159b354a09f8d",
    "Digest Type    1": "2",
    "Domain Name": "norid.no",
    "Last updated": "2021-12-14",
    "NORID Handle": "NOR18456D-NORID",
    "Name Server Handle": [
      "AUTH681H-NORID",
      "AUTH682H-NORID",
      "Y4H-NORID",
      "Z11H-NORID"
    ],
  }
[...]
$ jswhois jprs.jp | jq
{
  "chain": [
    "whois.iana.org",
    "whois.jprs.jp"
  ],
  "query": "jprs.jp",
  "whois.jprs.jp": {
    "Domain Information": {
      "Domain Information": "[ドメイン情報]",
      "[Domain Name]": "JPRS.JP",
      "[Name Server]": [
        "ns1.jprs.jp",
        "ns2.jprs.jp",
        "ns3.jprs.jp",
        "ns4.jprs.jp"
      ],
      "[Registrant]": "Japan Registry Services Co.,Ltd.",
      "[Signing Key]": [
        "59551 8 2 (",
        "F7700A9A545DD57075E545AFE2D823CB",
        "90A2C9A1305E1696C61F91BEA26FA137 )"
      ],

[...]

This is tedious, sure, but what's even more annoying is that it still is only of limited usefulness: aside from the lack of a data format, there is also no standard specification of what data is to be provided, and for the data that is required at least by ICANN, there is no requirement or specification of how that data is to be named.

That is, if you want to use jswhois(1) to return to you the email address of the administrative contact of the domain in question, then you still have to know what the fields returned by the registrar's WHOIS server are named. Commercial services may attempt to reformat or rename fields so that you have consistent keys to extract, but will that work for all domains? How many different WHOIS formats are there?

Registrars and Registries

Looking at a subset of TLDs from my previous adventure, I found a total of 1021 distinct WHOIS servers for 1489 TLDs. Here's the top ten breakdown of which WHOIS servers are responsible for the most number of TLDs:

 244 whois.iana.org
  67 whois.afilias-srs.net
  46 whois.nic.google
  24 whois.uniregistry.net
  16 whois.registry.in
  14 whois.nic.gmo
   8 whois.gtld.knet.cn
   7 whois.teleinfo.cn
   6 whois.gtlds.nic.br
   5 whois.publicinterestregistry.net

IANA, Afilias, and Uniregistry not surprisingly manage the largest number of TLDs, and as you may remember from the new-TLD-landrush, Google had applied for over 100 TLDs and today runs 46 TLDs. (The largest number of TLDs registered by a single company goes to Donuts Inc. with 248, but they run a separate WHOIS server for each of those TLDs at whois.nic.<tld>.)

But that's only TLDs. There are over 2500 registrars accredited by ICANN, of which e.g., GoDaddy, currently the largest with over 72 million (!) domains, is just one. In theory, for each of the millions of second-level domains, there might be a different WHOIS server responsible, each with its own human-readable output format.

Data in WHOIS

The data found in WHOIS varies from registry to registry, not only in structure (as shown above), but of course also in content. Some include nameserver IP addresses, some don't. Some include DNSSEC information, others don't. I even found an (expired) x509 cert in the WHOIS data for 2001:dcd::/32.

If you search for IP addresses or CIDRs, you get back rather different data than if you search for domain names. APNIC, RIPE, and AFRINIC, for example, even give you some routing and geolocation information:

$ jswhois 2001:dd8:9:2::101:61 | jq
{
  "query": "2001:dd8:9:2::101:61",
  "whois.apnic.net": {
    "inet6num": {
      "geoloc": "-27.473058 153.014208",
      "inet6num": "2001:dd8:8::/45",
[...]
      }
    "route6": {
      "country": "AU",
      "descr": "APNIC Network",
      "last-modified": "2018-11-20T03:36:54Z",
      "mnt-by": "MAINT-APNIC-IS-AP",
      "origin": "AS4608",
      "route6": "2001:dd8:9::/48",
      "source": "APNIC"
    }
[...]

Given the loose specification, you can use the WHOIS protocol and server for just about any data. Team Cymru, for example, lets you look up AS numbers for the given IP addresses using WHOIS:

$ whois -h whois.cymru.com 2001:470:30:84:e276:63ff:fe72:3900
AS      | IP                                       | AS Name
2033    | 2001:470:30:84:e276:63ff:fe72:3900       | PANIX, US

And as you've no doubt noticed, some international WHOIS servers may return data to you in non-ASCII charsets, such as e.g., whois.kr, or whois.jprs.jp. How well do the various WHOIS API services handle what effectively amounts to random data that may be returned? I wonder...

$ whois -h whois.netmeister.org log4j

 ___________________________________________________ 
< ${jndi:ldap://www.netmeister.org/blog/whois.html} >
 --------------------------------------------------- 
        \  ^___^
         \ (ooo)\_______
           (___)\       )\/\
                ||----w |
                ||     ||

Old and busted...

Since the data in WHOIS is unpredictable (who knows what data is returned to you and what the format might be), unreliable (who knows if the data you're looking for, if it is present at all, is up to date), difficult to discover (bouncing from IANA along unpredictable, unreliable referral entries or betting on a few hard-coded servers), possibly available via different mechanisms (besides the standard TCP port 43, several WHOIS servers provide an HTTP API endpint), and often obscured or redacted (e.g., due to GDPR, but several WHOIS servers also require registration before either TCP port 43 or API access is granted)... why haven't we replaced it with Something Better(tm)?

There were some attempts to overhaul WHOIS, like the "Referral Whois" protocol (RWhois, RFC2167) or the now obsolete "WHOIS++", but it seems like one of those things everybody depends on, so changing it isn't going to be easy.

ICANN decided years ago to replace WHOIS with work dating back to 2012, and the "Registration Data Access Protocol (RDAP, RFC9082) certainly seems like a much better alternative. RDAP is RESTful and standardized based on an analysis by the IETF of the TLD WHOIS server responses; since 2019, ICANN requires registrars and registries to implement an RDAP service.

Fully replacing WHOIS does, however, not yet seem to be on the horizon, and we're still relying on what started out as perhaps the simplest possible protocol intended for human consumption. Sometimes the internet moves really slowly, and all I can hope is that nobody comes along and tries to put it on the blockchain...

January 9th, 2022


Links:

References:


Previous: [strlcat(3) > strncat(3)]  -- Next: [Infosec Skill Sets]
[homepage]  [blog]  [jschauma@netmeister.org]  [@jschauma]  [RSS]