The bulk_extractor tool is one of the tools on Backtrack that a single article is not going to do it a lot of justice but hopefully after reading the below you will be able to see the benefits and understand basic usage of this amazing tool. The bulk_extractor actually reminds me of various tools such as Power Grep for Windows that can be used in penetration tests to locate private data worth being called out in a deliverable. By no means will the below be a complete howto for the bulk_extractor but again it will attempt to shed some light on its purpose and some easy ways it can be used.
Bulk Extractor Disk Image Scanner Help:
bulk_extractor -h output
Usage: bulk_extractor [options] imagefile
runs bulk extractor and outputs to stdout a summary of what was found where
imagefile – the file to extract
or -R filedir – recurse through a directory of files
SUPPORT FOR E01 FILES COMPILED IN
SUPPORT FOR AFF FILES COMPILED IN
EXIV2 COMPILED IN
-o outdir – specifies output directory. Must not exist.
bulk_extractor creates this directory.
-b banner.txt- Add banner.txt contents to the top and bottom of every output file.
-r alert_list.txt – a file containing the alert list of features to alert
(can be a feature file or a list of globs)
(can be repeated.)
-w stop_list.txt – a file containing the stop list of features (white list
(can be a feature file or a list of globs)s
(can be repeated.)
-F – Read a list of regular expressions from to find
-f – find occurances of ; may be repeated.
results go into find.txt
-q nn – Quiet; only print every nn status reports
-C NN – specifies the size of the context window (default 16)
-G NN – specify the page size (default 16777216)
-g NN – specify margin (default 1048576)
-W n1:n2 – Specifies minimum and maximum word size
(default is -w6:14
-j NN – Number of threads to run (default 1)
Path Processing Mode:
-p /f – print the value of with a given format.
formats: r = raw; h = hex.
Specify -p – for interactive mode.
Specify -p -http for HTTP mode.
-Y – Start processing at o1 (o1 may be 1, 1K, 1M or 1G)
-Y – - Process o1-o2
-A – Add to all reported feature offsets
-V print version number
-c – Enable Crash Protection
-M nn – sets max recursion depth (default 5)
-z nn – start on page nn
-dN – debug mode (see source code
-Z – zap (erase) output directory
Control of Scanners:
-e wordlist – enable scanner wordlist
-x accts – disable scanner accts
-x base64 – disable scanner base64
-x kml – disable scanner kml
-x email – disable scanner email
-x gps – disable scanner gps
-x aes – disable scanner aes
-x json – disable scanner json
-x exif – disable scanner exif
-x zip – disable scanner zip
-x gzip – disable scanner gzip
-x pdf – disable scanner pdf
-x hiber – disable scanner hiber
-x winprefetch – disable scanner winprefetch
Bulk Extractor Details, Path, Etc.:
The bulk_extractor is used to scan a disk image to search for interesting data such as credit card numbers, email addresses, domain names, urls, telephone numbers, text messages, etc. You can also specify which scanners to enable or disable making the output highly customizable. One of the really cool features is the simplistic approach to also add regular expression searches either via a single switch or a list of regular expressions via a file. While the bulk-extractor has many more uses we are going to concentrate on a few simple switches and show how the bulk_extractor can be used to search a mounted hard drive. I believe some of the confusion with the bulk extractor comes in when people don’t have RAW images to search and attempt to use the bulk_extractor to search specific paths on a server which is not the intended usage as far as I undersand it. The bulk_extractor is located at /usr/local/bin/bulk_extractor which is in the path so you can launch it from any location within Backtrack. The first example below shows a fairly simple bulk_extractor example scanning the Backtrack root mount point which is typically /dev/sda1.
Bulk Extractor Example Scanning A Mounted Partition:
root@bt:~# bulk_extractor -o betest1 -R /dev/sda1 -M 1 -j 20 -q 60 bulk_extractor version:1.2.0 Hostname: bt Input file: /dev/sda1 Output directory: betest1 Disk Size: 30836523008 Threads: 20 Phase 1. 22:37:58 Offset 0MB (0.00%) Done in n/a at 22:37:57 XMP Toolkit error 201: XML parsing failure Warning: Failed to decode XMP metadata. Warning: JPEG format error, rc = 4 Warning: JPEG format error, rc = 7
I used control-C to end this command as it was going to take over an hour to complete and I simply wanted to provide enough data to let others understand basic bulk_extractor usage. Also notice some of the errors above which are common when searching an entire hard drive and I assume simply mean there were issues reading a specific file because of formatting or other reasons but the bulk_extractor continued to work without issue and still output data into the different output files included in the report. The bulk_extractor -o switch specifies the output directory where all of the results will be stored and it should be noted that you can add a warning banner with the -b switch to the top of each of the output files. The bulk_extractor -R switch specifies the partition to scan which in this case is the root or / mount within a default Backtrack Linux installation. The bulk_extractor -M switch sets the max recursion depth which in this case was set to 1. The bulk-extractor -j switch sets the number of threads to kick off which in this case is 20 and while the more threads that are running the faster the scan will run you should tweak this depending on the power of the computer running the bulk_extractor command as it can make load shoot through the roof if you go overboard. Finally the bulk_extractor -q switch is to set the number of status report output messages that display in the terminal after the command is issued which in this case is every 60 seconds. The status report messages are similar to the line in the example above right underneath the Phase 1 line. Below is a file list of the betest1 directory that was created during the scan and keep in mind that this was running for a minute or so.
Bulk Extractor Example List Of Output Files:
root@bt:~# ls -alh betest1/ total 3.0M drwxr-xr-x 2 root root 4.0K 2012-04-30 22:37 . drwx------ 39 root root 4.0K 2012-04-30 22:37 .. -rw-r--r-- 1 root root 0 2012-04-30 22:37 aes_keys.txt -rw-r--r-- 1 root root 8.9K 2012-04-30 22:40 alerts.txt -rw-r--r-- 1 root root 0 2012-04-30 22:37 ccn_track2.txt -rw-r--r-- 1 root root 434 2012-04-30 22:39 ccn.txt -rw-r--r-- 1 root root 1.2M 2012-04-30 22:40 domain.txt -rw-r--r-- 1 root root 165K 2012-04-30 22:40 email.txt -rw-r--r-- 1 root root 28K 2012-04-30 22:40 ether.txt -rw-r--r-- 1 root root 0 2012-04-30 22:37 exif.txt -rw-r--r-- 1 root root 0 2012-04-30 22:37 find.txt -rw-r--r-- 1 root root 0 2012-04-30 22:37 gps.txt -rw-r--r-- 1 root root 94K 2012-04-30 22:40 json.txt -rw-r--r-- 1 root root 0 2012-04-30 22:37 kml.txt -rw-r--r-- 1 root root 7.9K 2012-04-30 22:39 report.xml -rw-r--r-- 1 root root 17K 2012-04-30 22:40 rfc822.txt -rw-r--r-- 1 root root 8.5K 2012-04-30 22:39 telephone.txt -rw-r--r-- 1 root root 1.5M 2012-04-30 22:40 url.txt -rw-r--r-- 1 root root 0 2012-04-30 22:37 winprefetch.txt -rw-r--r-- 1 root root 1.9K 2012-04-30 22:39 zip.txt root@bt:~#
As you can see the results from a minute are pretty amazing. We located what appears to be credit card numbers, domain names, email addresses, Ethernet addresses or MAC addresses, json data, rfc822 or text message data, telephone numbers, urls, and interesting data within zip files. Obviously not everything is always the exact type of data specified but using the bulk_extractor could cut the time analyzing data on a partition or image file by over 99%. To get an idea of the contents of each of the text files I have listed a line from each in the example output below and again keep in mind this is all from a default Backtrack Linux installation so all of the data is publicly available.
Bulk Extractor Output Files Example Content:
ccn.txt 453991358 5603777216154289 32c95\015\0125603316\015\0125603777216154289\015\0125603916\015\0125604\015 domain.txt 50385152 redhat.com lised: email@example.com\012Apr 13 12:46:10 email.txt 50385143 firstname.lastname@example.org 9) initialised: email@example.com\012Apr 13 12:46:10 ether.txt 50389204 00:0c:29:bb:ef:22 970A at 0x2000, 00:0c:29:bb:ef:22 assigned IRQ 19 json.txt 151126863 ["SQLI", "MULTI", "REDIRECT", "RCE", "RFI", "LFI", "UPLOAD", "UNKNOWN", "XSS"] rfc822.txt 486916201 Host: appworld.blackberry.com cc=-1 HTTP/1.1\015\012Host: appworld.blackberry.com\015\012Connection: cl telephone.txt 453001821 (858) 373-8773 1024)\015\012(8./CB6\015\012(858) 373-8773\015\012(9MeN0)\015\012(:pcp url.txt 50768235 https://www.isc.org/software/dhcp/ o, please visit https://www.isc.org/software/dhcp/\012Apr 15 14:49:52 zip.txt 294614565 ReflectiveDllInjection/HS-P005_ReflectiveDllInjection.pdf <zipinfo><name>ReflectiveDllInjection/HS-P005_ReflectiveDllInjection.pdf</name><name_len>57</name_len><version>20</version><compression_method>2</compression_method><uncompr_size>165921</uncompr_size><compr_size>156732</compr_size><lastmodtime>8</lastmodtime><lastmoddate>22776</lastmoddate><crc32>1681558789</crc32><extra_field_len>0</extra_field_len><disposition bytes='163241'>decompressed</disposition></zipinfo>
So now you have an idea of what the bulk_extractor will output, how to run a simple bulk_extractor scan, and how to configure some of the more widely used switches. I recommend reading the output of “bulk_extractor -h” which is included near the top of the article and the output of “man bulk_extractor” which is included near the bottom of this article to make sure you are aware of the different switches and how to tune bulk_extractor as it runs and how to tune the output contents. One other quick example I wanted to display was the use of the -f switch or the switch that allows you to also search for a specific regular expression. Also keep in mind that you could create a list of regular expressions and feed them into bulk_extractor using the -F command line switch.
Use bulk_extractor To Search Regular Expressions:
root@bt:~# bulk_extractor -E email -f '[Bb]acktrack' -o betest2 -R /dev/sda1 -M 1 -j 20 -q 30 bulk_extractor version:1.2.0 Hostname: bt Input file: /dev/sda1 Output directory: betest2 Disk Size: 30836523008 Threads: 20 Phase 1. 23:15:40 Offset 0MB (0.00%) Done in n/a at 23:15:39 23:16:06 Offset 503MB (1.63%) Done in 0:26:07 at 23:42:13 ^C root@bt:~#
In the above example you will notice that we are now using two more switches including the -E switch which disables all scanners but the one specified and the -f switch which allows you to specify a regular expression on the command line. In this example we are using the email scanner which searches for a specific set of data included in the output files below and the regular expression “[Bb]acktrack” which will look for Backtrack or backtrack in all of the files on the partition we are searching. Below is a list of output files that are created with the above command which you will notice are all familiar except for the find.txt file which contains the results from the regular expression that was searched.
Bulk Extractor Regular Expression Search & Email Scanner Output Files:
root@bt:~# ls -alh betest2/ total 1.1M drwxr-xr-x 2 root root 4.0K 2012-04-30 23:15 . drwx------ 40 root root 4.0K 2012-04-30 23:15 .. -rw-r--r-- 1 root root 0 2012-04-30 23:15 alerts.txt -rw-r--r-- 1 root root 423K 2012-04-30 23:16 domain.txt -rw-r--r-- 1 root root 78K 2012-04-30 23:16 email.txt -rw-r--r-- 1 root root 20K 2012-04-30 23:16 ether.txt -rw-r--r-- 1 root root 669 2012-04-30 23:16 find.txt -rw-r--r-- 1 root root 5.6K 2012-04-30 23:16 report.xml -rw-r--r-- 1 root root 1.5K 2012-04-30 23:16 rfc822.txt -rw-r--r-- 1 root root 505K 2012-04-30 23:16 url.txt root@bt:~#
Bulk Extractor Regular Expression Search Example Results: find.txt
root@bt:~# less betest2/find.txt # Feature File Version: 1.1 323395885 backtrack /.svn/prop-base/backtrack5_modifier.rb.sv 323396324 backtrack fier/.svn/props/backtrack5_modifier.rb.sv 323396759 backtrack /.svn/text-base/backtrack5_modifier.rb.sv 323399071 backtrack er/.svn/wcprops/backtrack5_modifier.rb.sv 323399840 backtrack ib/lab/modifier/backtrack5_modifier.rbe<8D>A 324657160 backtrack /prop-base/dasm-backtrack.rb.svn-basePK\003\004 324662240 backtrack .svn/props/dasm-backtrack.rb.svn-workPK\003\004 324675336 backtrack /text-base/dasm-backtrack.rb.svn-baseMQMo 324768219 backtrack vn/wcprops/dasm-backtrack.rb.svn-work\025<C9><CB>\016 324786413 backtrack sm/samples/dasm-backtrack.rbMQMo<DB>0\014<BD><F3>W<B0><C8>! root@bt:~#
So there you have the basic usage of the bulk_extractor command. Keep in mind that the default use of bulk_extractor is to search RAW image files however it is extremely handy for me when using it to search partitions. Below is a list of the output file names and their descriptions that I pulled directly from the old bulk_extractor web site.
The bulk_extractor Output Files & Their Descriptions:
- alerts.txt: Processing errors.
- ccn.txt: Credit card numbers
- ccn_track2.txt: Credit card “track 2″ informaiton, which has previously been found in some bank card fraud cases.
- domain.txt: Internet domains found on the drive, including dotted-quad addresses found in text.
- email.txt: Email addresses.
- ether.txt: Ethernet MAC addresses found through IP packet carving of swap files and compressed system hibernation files and file fragments.
- exif.txt: EXIFs from JPEGs and video segments. This feature file contains all of the EXIF fields, expanded as XML records.
- find.txt: The results of specific regular expression search requests.
- ip.txt: IP addresses found through IP packet carving.
- rfc822.txt: Email message headers including Date:, Subject: and Message-ID: fields.
- tcp.txt: TCP flow information found through IP packet carving.
- telephone.txt: US and international telephone numbers.
- url.txt: URLs, typically found in browser caches, email messages, and pre-compiled into executables.
- url_searches.txt: A histogram of terms used in Internet searches from services such as Google, Bing, Yahoo, and others.
- url_services.txt: A histogram of the domain name portion of all the URLs found on the media.
- wordlist.txt: A list of all “words” extracted from the disk, useful for password cracking.
- wordlist_*.txt: The wordlist with duplicates removed, formatted in a form that can be easily imported into a popular password-cracking program.
- zip.txt: A file containing information regarding every ZIP file component found on the media. This is exceptionally useful as ZIP files contain internal structure and ZIP is increasingly the compound file format of choice for a variety of products such as Microsoft Office
Last but not least you can click on the title below to see the entire bulk_extractor man page which again is definitely worth checking out if you plan on using the bulk-extractor tool within Backtrack Linux!
Bulk Extractor Scan Utility Man Page:
bulk_extractor man page
bulk_extractor – Scans a disk image for regular expressions and other content.
bulk_extractor -o output_dir [options] [ image | -R dir ]
bulk_extractor scans a disk image (or any other file) for a large number of pre-defined regular expressions and other kinds of content.
These items are called features. When it finds a feature, bulk_extractor writes the output to an output file. Each line of the output
file contains a byte offset at which the feature was found, a tab, and the actual feature. Features therefore cannot contain the end-of-
bulk_extractor includes native support for EnCase (.E01) and AFFLIB (.aff) files, if it compiled and linked on a system containing those
libraries. Alternatively, the -R option can be used to recursively scan and process a directory of individual files (disk images in such
a directory will be treated as files, not as disk images).
bulk_extractor is multi-threaded. By specifying the -j option, multiple copies of the program can be run. Each thread writes its results
into its own feature file. The files are then combined by the primary thread when all of the secondary threads complete.
bulk_extractor is a two-phase program. In phase 1 the features are extracted. In phase 2 a histogram is created of relevant features.
bulk_extractor will also create a wordlist of all the words that are found in the disk image. This can be used as a dictionary for
The options are as follows:
Specifies the output directory, which will be created by bulk_extractor.
Read the contents of bannerfile.txt and stamp it at the beginning of each output file. This might be useful if you have some kind
of privacy banner that needs to be stamped at the top of all of your files.
Specifies an alert list, (or red list), which is a list of terms that, if found, will be specifically flagged in a special alert
file that begins with the letters ALERT. The alert list may contain individual terms, which must be found in their entirity and
are case-sensitive, or wildcards with standard Unix globbing (e.g. *@company.com). Globbed terms are case-insensitive.
Specifies a stop list, (or white list), which is a list of terms that, if found, will be placed in a special stopped file (rather
than in the main file). The whitelist may also contain globbed terms.
Specifies a context-dependent stoplist, which is a list of tokens to be stopped (but only when they have a particular context).
Open a disk image and print the information found at path. The format specification may be r for raw output and h for hex out‚Äê
Specifies a file of regular expressions to be used as search terms.
Specifies a regular expression to be used as a search term.
-q nn Quiet mode. Only prints every nn status reports.
These commands are useful for tuning operation:
-C n Specifies the size of the context window.
-g n Specifies the size of the margin in bytes.
The scan_wordlist scanner should only extract words that are between n1 and n2 characters in length.
-j n Use n threads for analysis. Normally you do not need specify this, as the default is the number of processors on the current com‚Äê
The following commands are useful for debugging:
-V Print the version number
Restarts the program from where it left off for a particular directory.
-B nn Set the dedup Bloom filter to nn bits. This is used by the scan_wordlist scanner.
-M nn Specifies a maximum recursion depth of nn.
Start on page number pagenum.
Start at offset
-dN Enable debugging level N.
Finally, you can control scanners with these options:
Turns off all scanners, then enabled scanner scanner.
Enables a scanner.
Disables a scanner.
bulk_extractor is based on a feature extractor and named entity recognizer developed for SBook in 1991. The feature extractor was repur‚Äê
posed for disk images in 2003. The stand-alone bulk_extractor program was rewritten in 2005 and publicly released in 2007. The multi-
threaded bulk_extractor was released in May 2010.
User Manuals MAY 2010 BULK_EXTRACTOR(1)