OSINT Reconnaissance: Information Gathering Techniques

Back to guides

Open Source Intelligence (OSINT) refers to the collection and analysis of information from publicly available sources. In security assessments and penetration testing, OSINT is typically the first phase of reconnaissance, providing valuable insights about a target without direct interaction.

This guide covers essential OSINT techniques and tools for gathering information ethically and legally.

Legal and Ethical Considerations

Before conducting any OSINT research:

Ensure you have proper authorization if targeting an organization
Respect privacy laws and terms of service
Document your findings and methodology
Never access systems or data without permission
Be aware of jurisdiction-specific regulations

OSINT should only gather publicly available information. Crossing into unauthorized access is illegal.

Domain and DNS Intelligence

WHOIS Lookups

WHOIS provides registration information for domains:

whois example.com

Information typically includes:

Registrar details
Registration and expiration dates
Name servers
Contact information (if not privacy-protected)

DNS Enumeration

Gather DNS records to understand the target’s infrastructure:

# All DNS records
dig example.com ANY

# Specific record types
dig example.com MX
dig example.com TXT
dig example.com NS

# Zone transfer attempt (rarely works)
dig axfr @ns1.example.com example.com

Subdomain Discovery

Finding subdomains reveals additional attack surface:

# Using amass
amass enum -d example.com

# Using subfinder
subfinder -d example.com

# Using dnsenum
dnsenum example.com

Manual methods:

Certificate Transparency logs (crt.sh)
Search engines with site:*.example.com
Brute forcing with wordlists

Certificate Transparency

Search certificate logs for issued certificates:

curl -s "https://crt.sh/?q=%.example.com&output=json" | jq '.[] | .name_value' | sort -u

This often reveals subdomains, internal hostnames, and related domains.

Search Engine Intelligence

Google Dorking

Advanced search operators help find specific information:

Operator	Purpose	Example
site:	Limit to domain	site:example.com
filetype:	Specific file types	filetype:pdf
intitle:	Text in page title	intitle:”index of”
inurl:	Text in URL	inurl:admin
intext:	Text in page body	intext:password
ext:	File extension	ext:sql

Useful combinations:

site:example.com filetype:pdf
site:example.com inurl:admin
site:example.com ext:sql OR ext:bak
site:example.com intitle:"index of"

Finding exposed files:

site:example.com filetype:log
site:example.com filetype:conf
site:example.com filetype:env

Other Search Engines

Different search engines index different content:

Bing: Sometimes indexes content Google misses
DuckDuckGo: Uses Bing’s index with different ranking
Yandex: Better coverage of Russian and Eastern European sites
Baidu: Chinese content

Email Intelligence

Email Format Discovery

Common patterns:

first.last@company.com
flast@company.com
firstl@company.com
first@company.com

Tools for verification:

# Using theHarvester
theHarvester -d example.com -b all

Email Header Analysis

Email headers reveal:

Originating IP addresses
Mail servers used
Authentication results (SPF, DKIM, DMARC)
Routing path

Analyze headers at online tools or manually:

Received: from mail.sender.com (192.168.1.1)
X-Originating-IP: [10.0.0.5]

Valuable for:

Employee names and roles
Technology stack (from job postings)
Organizational structure
Company size and growth

Search techniques:

Current employees at company
Past employees who might share information
Job postings revealing technologies used

GitHub and Code Repositories

Developers often accidentally expose sensitive information:

# Search for company repositories
# Look for:
# - API keys and tokens
# - Configuration files
# - Internal documentation
# - Employee usernames

Things to look for in repositories:

Commit history (deleted secrets may still exist)
Issues and pull requests
Contributor information
README files with internal details

Document Metadata

Documents often contain metadata with usernames, software versions, and internal paths:

# Extract metadata with exiftool
exiftool document.pdf

# Using metagoofil
metagoofil -d example.com -t pdf,doc,xls -o output/

Common metadata fields:

Author
Creation software
Internal file paths
Revision history

Infrastructure Mapping

IP Address Intelligence

Gather information about IP addresses:

# WHOIS for IP
whois 192.168.1.1

# Reverse DNS
dig -x 192.168.1.1

# ASN lookup
whois -h whois.cymru.com " -v 192.168.1.1"

Shodan

Shodan indexes internet-connected devices:

# Basic search
shodan search hostname:example.com

# Search by organization
shodan search org:"Example Company"

# Filter by port
shodan search hostname:example.com port:22

Shodan reveals:

Open ports and services
Banner information
SSL certificate details
Potential vulnerabilities

Censys

Similar to Shodan with different data:

Focuses on certificates and hosts
Good for finding assets by certificate attributes
Historical data available

Web Application Reconnaissance

Technology Fingerprinting

Identify technologies used:

# Using whatweb
whatweb example.com

# Using wappalyzer (browser extension)
# Identifies CMS, frameworks, libraries

Web Archives

The Wayback Machine preserves historical snapshots:

View old versions of websites
Find removed content
Track changes over time
Discover old endpoints

# Get archived URLs
curl "http://web.archive.org/cdx/search/cdx?url=example.com/*&output=text&fl=original&collapse=urlkey"

robots.txt and sitemap.xml

These files often reveal hidden paths:

curl https://example.com/robots.txt
curl https://example.com/sitemap.xml

Data Aggregation Tools

theHarvester

Comprehensive email and subdomain gathering:

theHarvester -d example.com -b all -l 500

Sources include:

Search engines
PGP key servers
LinkedIn
DNS brute force

Maltego

Visual link analysis tool for:

Mapping relationships
Discovering connections
Visualizing infrastructure

SpiderFoot

Automated OSINT collection:

# Run scan
spiderfoot -s example.com -t all

Collects data from over 100 sources automatically.

Organizing Findings

Documentation

Maintain detailed records:

Source of each piece of information
Date collected
Relevance to assessment
Confidence level

Data Validation

Cross-reference findings:

Verify from multiple sources
Check for outdated information
Distinguish confirmed from speculative data

Reporting Structure

Organize findings by category:

Domain and DNS information
Network infrastructure
Personnel and organizational data
Technology stack
Potential vulnerabilities
Social media presence

Operational Security

When conducting OSINT:

Use VPN or Tor for anonymity when appropriate
Create separate research accounts
Clear browser data and cookies
Be aware of logging and tracking
Don’t interact directly with target systems

Practical Workflow

A typical OSINT workflow:

Define scope and objectives
Passive reconnaissance (no target interaction)
Domain and DNS enumeration
Search engine intelligence
Social media research
Document and metadata collection
Technology identification
Aggregate and analyze findings
Document and report

Limitations

OSINT has limitations:

Information may be outdated
Not all data is accurate
Privacy protections hide information
Requires interpretation and analysis
Cannot replace active reconnaissance

Conclusion

OSINT is a critical first step in security assessments. By gathering publicly available information, you can understand a target’s attack surface without triggering any alerts.

Key principles:

Stay within legal boundaries
Document everything
Verify findings from multiple sources
Organize data systematically
Protect your own operational security

The more thorough your OSINT phase, the more effective your subsequent testing will be. Practice these techniques regularly to develop intuition for where valuable information hides.