Commit Graph

32 Commits

Author SHA1 Message Date
Peter Hedenskog b9456eef6e
Replace intel with sitespeed.io/log (#4381)
* Replace intel with sitespeed.io/log
2025-01-07 08:53:48 +01:00
Peter Hedenskog 8890a9b256
Update latest eslint and dependencies (#4345) 2024-12-22 15:20:16 +01:00
Peter Hedenskog 3741366d45
Upgrade to eslint/unicorn 54 (#4213) 2024-07-08 08:19:41 +02:00
Peter Hedenskog f85e54941b
Fix broken crawler (#3820) 2023-04-23 05:56:46 +02:00
Peter Hedenskog 631271126f
New plugins structure and esmodule (#3769)
* New plugins structure and esmodule
2023-02-25 11:16:58 +01:00
Peter Hedenskog f46a366752
If you set a user agent for Browsertime, also use it for the crawler (#3652) 2022-05-17 05:12:54 +02:00
Peter Hedenskog 426fb42bca
Tune the cookie handling to handle = in the cookie (#3473)
* Tune the cookie handling to handle = in the cookie

* fix path
2021-10-08 18:43:36 +02:00
dammg ad44d6290d
Allow crawler to also send the configured cookies (#3472)
The crawler should open pages with the same setup in order to get full results. In my case an authentication cookie is needed, to properly open the page and see its full content (including crawlable links).
2021-10-07 20:19:00 +02:00
dammg 094f9fda56
Add option for crawler to ignore robots.txt (#3454)
* Add option for crawler to ignore robots.txt

For example we have an internal test site (a sort of showcase of all our modules), that has a noFollow rule on all its pages. With that the crawler refuses to discover any pages. However there is an option in the crawler to ignore the robots.txt. This is basically my attempt at passing that option through. I have this currently running as a patched version on our site.
2021-09-03 21:16:30 +02:00
Peter Hedenskog caddb34d65
Verify that depth is set when you crawl #2806 (#2807) 2019-11-29 10:10:03 +01:00
Samuli Reijonen b97dce509e Add --crawler.include (#2763) 2019-11-09 21:55:03 +01:00
Ferdinand Holzer 3c5ccc338c Add support for crawler exclude patterns (#2319)
* Add support for excluding patterns from crawling. Resolves #1929

* Make eslint happy, fix error handling issue
2019-02-17 17:38:33 +01:00
Peter Hedenskog 7cc5562204
Remove Bluebird promises and use await/sync where we can. (#2205) 2018-11-20 09:14:05 +01:00
Peter Hedenskog da98a06cb6
first go at basic auth for crawl (#1845) 2017-12-05 08:59:11 +01:00
Peter Hedenskog e81be5d689
Feed plugins with messageMaker (#1760) 2017-10-29 09:22:27 +01:00
Tobias Lidskog 3debfec0b4 Format code using the Prettier formatter. (#1677) 2017-07-20 21:24:12 +02:00
soulgalore e5db4be248 info log crawler setup and when we stop 2017-04-12 12:49:43 +02:00
Peter Hedenskog 1e528f65fd set sitespeedio as root name of all loggers (#1545) 2017-03-23 12:21:11 +01:00
Peter Hedenskog e46a7026eb Add log channel names per plugin thank you @jpvincent (#1544) 2017-03-23 08:57:03 +01:00
Tobias Lidskog 720d3b93c2 Set plugin name by default when loading it 2017-03-13 17:40:29 +01:00
Tobias Lidskog 47dce74074 Upgrade simplecrawler to 1.0.1. 2016-08-27 17:09:23 +02:00
Tobias Lidskog fae6b8ba3d Tag messages with group, based on url or filename. (#1157)
Lay the foundation for grouping data from multiple urls. Tag all messages originating from a single url (browsertime.pageSummary, coach.pageSummary etc.) with a group. Aggregations based on group will be a breaking change, so that will follow in a later changeset.

Urls passed directly on the command line will be tagged with a group based on the domain. When passing urls via text files, the group will be generated from the file name.
2016-08-25 09:26:26 +02:00
Tobias Lidskog 1779d75693 Fix spinning crawl when using maxPages.
Turns out the 'complete' event wasn't being sent when the parser was explicitly stopped.
2016-05-13 21:55:51 +02:00
Tobias Lidskog 315ae102e1 Implement crawler.maxPages to limit pages in crawl 2016-05-13 18:16:35 +02:00
soulgalore 616dbab278 skip links in HTML comments #896 2016-05-13 08:09:12 +02:00
Tobias Lidskog 06e9933db4 Rename crawler.maxDepth to crawler.depth. 2016-05-10 22:06:29 +02:00
Tobias Lidskog dad7546e95 Filter out non-html pages from crawler. 2016-05-10 22:06:29 +02:00
Tobias Lidskog e52b4a8503 Make url crawl much more functional. 2016-05-08 07:32:53 +02:00
Tobias Lidskog 551ef59297 Skip crawl if depth is 0 or 1. 2016-05-08 07:32:53 +02:00
soulgalore 9842b57831 debug log each URL the crawler finds 2016-04-25 14:51:45 +02:00
soulgalore 92503dc909 more logs 2016-04-14 11:47:58 +02:00
Tobias Lidskog c840d65c55 Initial draft of node based crawler. 2016-03-23 00:39:46 +01:00