在我之前的文章 “Elastic:导入 Word 及 PDF 文件到 Elasticsearch 中”,我详细描述了如何安装 FSCrawler 来摄入 Word 及 PDF 文件。在那篇文章中,我们使用了解压缩包的方法。在这篇文章中,我们来描述如何使用 Docker 的方法来进行安装并摄入需要的文件。 FSCrawler. 可以在地址 GitHub - dadoonet/fscrawler: Elasticsearch File System Crawler (FS Crawler) 找到。
我们首先按照之前的文章 “Elastic:导入 Word 及 PDF 文件到 Elasticsearch 中” 中描述的那样创建几个简单的 Word 及 PDF 文件,并把它们置于我们的目录中:
- $ pwd
- /Users/liuxg/tmp/docs
- $ ls
- doc1.docx doc2.pdf doc3.docx
你也可以在地址 https://github.com/dadoonet/fscrawler/tree/master/test-documents/src/main/resources/documents 下载一些测试文档。它基本涵盖了所需要的测试文档的类型。从列表出来的文档后缀可以看出来,它目前支持如下的文件格式:
如果你还没有安装好自己的 Elasticsearch 及 Kibana 的话,那么请参考如下的文章来安装好自己的 Elasticsearch 及 Kibana。
等安装好后,我们的 Elasticsearch 可以在地址 https://localhost:9200 来进行访问,也可以在你机器的私有地址下进行运行 https://privateIP:9200。你可以通过如下的命令来查看你机器的 IP 地址:
- $ ifconfig | grep 192
- inet 192.168.0.3 netmask 0xffffff00 broadcast 192.168.0.255
我们使用如下的命令来拉取 Docker 镜像:
docker pull dadoonet/fscrawler
- $ docker pull dadoonet/fscrawler
- Using default tag: latest
- latest: Pulling from dadoonet/fscrawler
- 1fe172e4850f: Pull complete
- 44d3aa8d0766: Pull complete
- 6ce99fdf16e8: Pull complete
- 9c8cd828df6c: Pull complete
- 4760215418fb: Pull complete
- 8ac9abc1945a: Pull complete
- Digest: sha256:2950edb12619187de9823303ed0cf1a4dc2219f8faa80e3a7bdfbe46cb690a69
- Status: Downloaded newer image for dadoonet/fscrawler:latest
- docker.io/dadoonet/fscrawler:latest
注意:该镜像非常大(1.2+gb),因为它包含 Tesseract 和所有经过训练的语言数据。 如果你根本不想使用 OCR,则可以通过拉取使用较小的镜像(大约 530mb)dadoonet/fscrawler:noocr
docker pull dadoonet/fscrawler:noocr
假设你的文档位于 ~/tmp 目录中,并且你希望将 fscrawler 作业存储在 ~/.fscrawler 中。 你可以使用以下命令运行 FSCrawler:
docker run -it --rm -v ~/.fscrawler:/root/.fscrawler -v ~/tmp:/tmp/es:ro dadoonet/fscrawler fscrawler job_name
针对我的情况,我的文档位于 ~/tmp/docs 目录中,我们使用如下的命令:
docker run -it --rm -v ~/.fscrawler:/root/.fscrawler -v ~/tmp/docs:/tmp/es:ro dadoonet/fscrawler fscrawler job_name
- $ docker run -it --rm -v ~/.fscrawler:/root/.fscrawler -v ~/tmp/docs:/tmp/es:ro dadoonet/fscrawler fscrawler job_name
- 07:24:32,475 INFO [f.console] ,----------------------------------------------------------------------------------------------------.
- | ,---,. .--.--. ,----.. ,--, 2.10-SNAPSHOT |
- | ,' .' | / / '. / / \ ,--.'| |
- | ,---.' || : /`. / | : : __ ,-. .---.| | : __ ,-. |
- | | | .'; | |--` . | ;. /,' ,'/ /| /. ./|: : ' ,' ,'/ /| |
- | : : : | : ;_ . ; /--` ' | |' | ,--.--. .-'-. ' || ' | ,---. ' | |' | |
- | : | |-, \ \ `. ; | ; | | ,'/ \ /___/ \: |' | | / \ | | ,' |
- | | : ;/| `----. \| : | ' : / .--. .-. | .-'.. ' ' .| | : / / |' : / |
- | | | .' __ \ \ |. | '___ | | ' \__\/: . ./___/ \: '' : |__ . ' / || | ' |
- | ' : ' / /`--' /' ; : .'|; : | ," .--.; |. \ ' .\ | | '.'|' ; /|; : | |
- | | | | '--'. / ' | '/ :| , ; / / ,. | \ \ ' \ |; : ;' | / || , ; |
- | | : \ `--'---' | : / ---' ; : .' \ \ \ |--" | , / | : | ---' |
- | | | ,' \ \ .' | , .-./ \ \ | ---`-' \ \ / |
- | `----' `---` `--`---' '---" `----' |
- +----------------------------------------------------------------------------------------------------+
- | You know, for Files! |
- | Made from France with Love |
- | Source: https://github.com/dadoonet/fscrawler/ |
- | Documentation: https://fscrawler.readthedocs.io/ |
- `----------------------------------------------------------------------------------------------------'
- 07:24:32,487 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [160.6mb/2.8gb=5.48%], RAM [8.2gb/11.4gb=72.44%], Swap [1023.9mb/1023.9mb=100.0%].
- 07:24:32,528 INFO [f.console] job [job_name] does not exist
- 07:24:32,528 INFO [f.console] Do you want to create it (Y/N)?
- y
- 07:24:35,483 INFO [f.console] Settings have been created in [/root/.fscrawler/job_name/_settings.yaml]. Please review and edit before relaunch
在第一次运行时,如果 ~/.fscrawler 中尚不存在该作业,FSCrawler 将询问你是否要创建它:
注意:配置文件实际上存储在你机器上的 ~/.fscrawler/job_name/_settings.yaml 中。 请记住更改你的 elasticsearch 实例的 URL,因为容器将无法看到它在默认 127.0.0.1 下运行。 你将需要使用主机的实际 IP 地址。
我们接下来编辑文件 _settings.yaml 文件:
- $ pwd
- /Users/liuxg/.fscrawler/job_name
- $ vi _settings.yaml

如上所示,我们需要做上面的配置。为了方便,我们特意设置 ssl_verification 为 false。你需要根据自己的 Elasticsearch 端点及用户账号信息进行修改。修改完后保存,我们再次运行 Docker:

我们进入到 Kibana 的页面,并进行查看:
GET _cat/indices

我们可以看到有两个新的索引已经生产:open job_name_folder 及 job_name。


我们在一个目录里创建如下的文件结构:

针对 Elasticsearch,我们按照如下的配置来创建一个叫做 _settings.yaml 的文件:
- ---
- name: "idx"
- fs:
- indexed_chars: 100%
- lang_detect: true
- continue_on_error: true
- ocr:
- language: "eng"
- enabled: true
- pdf_strategy: "ocr_and_text"
- elasticsearch:
- nodes:
- - url: "https://elasticsearch:9200"
- username: "elastic"
- password: "changeme"
- ssl_verification: false
- rest :
- url: "http://fscrawler:8080"
注意:上面显示的配置也意味着启动 REST 接口。 它还使用英语激活文档的完整索引、语言检测和 ocr。 你可以根据需要调整此示例。
我们按照地址 https://github.com/dadoonet/fscrawler/blob/master/contrib/docker-compose-example-elasticsearch/docker-compose.yml 来创建 docker-compose.yaml 文件:
docker-compose.yml
- ---
- version: "2.2"
-
- services:
- setup:
- image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
- volumes:
- - certs:/usr/share/elasticsearch/config/certs
- user: "0"
- command: >
- bash -c '
- if [ x${ELASTIC_PASSWORD} == x ]; then
- echo "Set the ELASTIC_PASSWORD environment variable in the .env file";
- exit 1;
- elif [ x${KIBANA_PASSWORD} == x ]; then
- echo "Set the KIBANA_PASSWORD environment variable in the .env file";
- exit 1;
- fi;
- if [ ! -f certs/ca.zip ]; then
- echo "Creating CA";
- bin/elasticsearch-certutil ca --silent --pem -out config/certs/ca.zip;
- unzip config/certs/ca.zip -d config/certs;
- fi;
- if [ ! -f certs/certs.zip ]; then
- echo "Creating certs";
- echo -ne \
- "instances:\n"\
- " - name: elasticsearch\n"\
- " dns:\n"\
- " - elasticsearch\n"\
- " - localhost\n"\
- " ip:\n"\
- " - 127.0.0.1\n"\
- > config/certs/instances.yml;
- bin/elasticsearch-certutil cert --silent --pem -out config/certs/certs.zip --in config/certs/instances.yml --ca-cert config/certs/ca/ca.crt --ca-key config/certs/ca/ca.key;
- unzip config/certs/certs.zip -d config/certs;
- fi;
- echo "Setting file permissions"
- chown -R root:root config/certs;
- find . -type d -exec chmod 750 \{\} \;;
- find . -type f -exec chmod 640 \{\} \;;
- echo "Waiting for Elasticsearch availability";
- until curl -s --cacert config/certs/ca/ca.crt https://elasticsearch:9200 | grep -q "missing authentication credentials"; do sleep 30; done;
- echo "Setting kibana_system password";
- until curl -s -X POST --cacert config/certs/ca/ca.crt -u elastic:${ELASTIC_PASSWORD} -H "Content-Type: application/json" https://elasticsearch:9200/_security/user/kibana_system/_password -d "{\"password\":\"${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
- echo "All done!";
- '
- healthcheck:
- test: ["CMD-SHELL", "[ -f config/certs/elasticsearch/elasticsearch.crt ]"]
- interval: 1s
- timeout: 5s
- retries: 120
-
- elasticsearch:
- depends_on:
- setup:
- condition: service_healthy
- image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
- volumes:
- - certs:/usr/share/elasticsearch/config/certs
- - esdata:/usr/share/elasticsearch/data
- ports:
- - ${ES_PORT}:9200
- environment:
- - node.name=elasticsearch
- - cluster.name=${CLUSTER_NAME}
- - cluster.initial_master_nodes=elasticsearch
- - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
- - bootstrap.memory_lock=true
- - xpack.security.enabled=true
- - xpack.security.http.ssl.enabled=true
- - xpack.security.http.ssl.key=certs/elasticsearch/elasticsearch.key
- - xpack.security.http.ssl.certificate=certs/elasticsearch/elasticsearch.crt
- - xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
- - xpack.security.http.ssl.verification_mode=certificate
- - xpack.security.transport.ssl.enabled=true
- - xpack.security.transport.ssl.key=certs/elasticsearch/elasticsearch.key
- - xpack.security.transport.ssl.certificate=certs/elasticsearch/elasticsearch.crt
- - xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
- - xpack.security.transport.ssl.verification_mode=certificate
- - xpack.license.self_generated.type=${LICENSE}
- mem_limit: ${MEM_LIMIT}
- ulimits:
- memlock:
- soft: -1
- hard: -1
- healthcheck:
- test:
- [
- "CMD-SHELL",
- "curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
- ]
- interval: 10s
- timeout: 10s
- retries: 120
-
- kibana:
- depends_on:
- elasticsearch:
- condition: service_healthy
- image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
- volumes:
- - certs:/usr/share/kibana/config/certs
- - kibanadata:/usr/share/kibana/data
- ports:
- - ${KIBANA_PORT}:5601
- environment:
- - SERVERNAME=kibana
- - ELASTICSEARCH_HOSTS=https://elasticsearch:9200
- - ELASTICSEARCH_USERNAME=kibana_system
- - ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
- - ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
- - ENTERPRISESEARCH_HOST=http://enterprisesearch:${ENTERPRISE_SEARCH_PORT}
- mem_limit: ${MEM_LIMIT}
- healthcheck:
- test:
- [
- "CMD-SHELL",
- "curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'",
- ]
- interval: 10s
- timeout: 10s
- retries: 120
-
- # FSCrawler
- fscrawler:
- image: dadoonet/fscrawler:$FSCRAWLER_VERSION
- container_name: fscrawler
- restart: always
- volumes:
- - ~/tmp/docs/:/tmp/es:ro
- - ${PWD}/config:/root/.fscrawler
- - ${PWD}/logs:/usr/share/fscrawler/logs
- depends_on:
- elasticsearch:
- condition: service_healthy
- ports:
- - ${FSCRAWLER_PORT}:8080
- command: fscrawler job_name --restart --rest
-
- volumes:
- certs:
- driver: local
- esdata:
- driver: local
- kibanadata:
- driver: local
注意:上面显示的配置也用于启动 Kibana。 如果不需要,可以跳过该部分。
在上面,我把测试文档置于路径 ~/tmp/docs/ 中。你可以根据自己的文档路径进行修改。同时,我们把 job_name 定义为我们的任务名称。如果你修改为其它的名称,那么你需要在上面的文档中进行替换,并同时修改相应的目录名称。
细心的开发者可能会发现,这里的 docker-compose.yml 文件和我之前的文章 “Elasticsearch:使用 Docker compose 来一键部署 Elastic Stack 8.x” 里的是非常相似的。在这里就不一一进行详述了。
同时为了能够配置上面的 docker-compose.yml 中的环境变量,我们还需要在 docker-compose.yml 所在的目录里创建一个如下的文件:
.env
- # FSCrawler Settings
- FSCRAWLER_VERSION=2.10-SNAPSHOT
- FSCRAWLER_PORT=8080
-
- # Password for the 'elastic' user (at least 6 characters)
- ELASTIC_PASSWORD=changeme
-
- # Password for the 'kibana_system' user (at least 6 characters)
- KIBANA_PASSWORD=changeme
-
- # Version of Elastic products
- STACK_VERSION=8.3.3
-
- # Set the cluster name
- CLUSTER_NAME=docker-cluster
-
- # Set to 'basic' or 'trial' to automatically start the 30-day trial
- #LICENSE=basic
- LICENSE=trial
-
- # Port to expose Elasticsearch HTTP API to the host
- ES_PORT=9200
-
- # Port to expose Kibana to the host
- KIBANA_PORT=5601
-
- # Enterprise Search settings
- ENTERPRISE_SEARCH_PORT=3002
- ENCRYPTION_KEYS=q3t6w9z$C&F)J@McQfTjWnZr4u7x!A%D
-
- # Increase or decrease based on the available host memory (in bytes)
- MEM_LIMIT=1073741824
-
- # Project namespace (defaults to the current folder name if not set)
- COMPOSE_PROJECT_NAME=fscrawler
请注意,我们配置的超级用户 elastic 的密码为 changeme。经过上面的配置后,我们的文件结构如下:
- $ pwd
- /Users/liuxg/fscrawler
- $ tree -aL 3
- .
- ├── .env
- ├── config
- │ └── job_name
- │ └── _settings.yaml
- ├── data
- ├── docker-compose.yml
- └── logs
然后,你可以运行整个堆栈,包括 FSCrawler:
docker-compose up -d
如果你不想它在后台运行的话,你可以使用如下的方式来运行:
docker-compose up

我们来登录 Kibana:
使用如下的命令来查看最新的索引:
GET _cat/indices

从上面的输出结果来看,它已经成功地对我们的文档进行了上传。
参考:
【1】 Download FSCrawler — FSCrawler 2.10-SNAPSHOT documentation