Peter Hedenskog
b9456eef6e
Replace intel with sitespeed.io/log ( #4381 )
...
* Replace intel with sitespeed.io/log
2025-01-07 08:53:48 +01:00
Peter Hedenskog
8890a9b256
Update latest eslint and dependencies ( #4345 )
2024-12-22 15:20:16 +01:00
Peter Hedenskog
3741366d45
Upgrade to eslint/unicorn 54 ( #4213 )
2024-07-08 08:19:41 +02:00
Peter Hedenskog
f85e54941b
Fix broken crawler ( #3820 )
2023-04-23 05:56:46 +02:00
Peter Hedenskog
631271126f
New plugins structure and esmodule ( #3769 )
...
* New plugins structure and esmodule
2023-02-25 11:16:58 +01:00
Peter Hedenskog
f46a366752
If you set a user agent for Browsertime, also use it for the crawler ( #3652 )
2022-05-17 05:12:54 +02:00
Peter Hedenskog
426fb42bca
Tune the cookie handling to handle = in the cookie ( #3473 )
...
* Tune the cookie handling to handle = in the cookie
* fix path
2021-10-08 18:43:36 +02:00
dammg
ad44d6290d
Allow crawler to also send the configured cookies ( #3472 )
...
The crawler should open pages with the same setup in order to get full results. In my case an authentication cookie is needed, to properly open the page and see its full content (including crawlable links).
2021-10-07 20:19:00 +02:00
dammg
094f9fda56
Add option for crawler to ignore robots.txt ( #3454 )
...
* Add option for crawler to ignore robots.txt
For example we have an internal test site (a sort of showcase of all our modules), that has a noFollow rule on all its pages. With that the crawler refuses to discover any pages. However there is an option in the crawler to ignore the robots.txt. This is basically my attempt at passing that option through. I have this currently running as a patched version on our site.
2021-09-03 21:16:30 +02:00
Peter Hedenskog
caddb34d65
Verify that depth is set when you crawl #2806 ( #2807 )
2019-11-29 10:10:03 +01:00
Samuli Reijonen
b97dce509e
Add --crawler.include ( #2763 )
2019-11-09 21:55:03 +01:00
Ferdinand Holzer
3c5ccc338c
Add support for crawler exclude patterns ( #2319 )
...
* Add support for excluding patterns from crawling. Resolves #1929
* Make eslint happy, fix error handling issue
2019-02-17 17:38:33 +01:00
Peter Hedenskog
7cc5562204
Remove Bluebird promises and use await/sync where we can. ( #2205 )
2018-11-20 09:14:05 +01:00
Peter Hedenskog
da98a06cb6
first go at basic auth for crawl ( #1845 )
2017-12-05 08:59:11 +01:00
Peter Hedenskog
e81be5d689
Feed plugins with messageMaker ( #1760 )
2017-10-29 09:22:27 +01:00
Tobias Lidskog
3debfec0b4
Format code using the Prettier formatter. ( #1677 )
2017-07-20 21:24:12 +02:00
soulgalore
e5db4be248
info log crawler setup and when we stop
2017-04-12 12:49:43 +02:00
Peter Hedenskog
1e528f65fd
set sitespeedio as root name of all loggers ( #1545 )
2017-03-23 12:21:11 +01:00
Peter Hedenskog
e46a7026eb
Add log channel names per plugin thank you @jpvincent ( #1544 )
2017-03-23 08:57:03 +01:00
Tobias Lidskog
720d3b93c2
Set plugin name by default when loading it
2017-03-13 17:40:29 +01:00
Tobias Lidskog
47dce74074
Upgrade simplecrawler to 1.0.1.
2016-08-27 17:09:23 +02:00
Tobias Lidskog
fae6b8ba3d
Tag messages with group, based on url or filename. ( #1157 )
...
Lay the foundation for grouping data from multiple urls. Tag all messages originating from a single url (browsertime.pageSummary, coach.pageSummary etc.) with a group. Aggregations based on group will be a breaking change, so that will follow in a later changeset.
Urls passed directly on the command line will be tagged with a group based on the domain. When passing urls via text files, the group will be generated from the file name.
2016-08-25 09:26:26 +02:00
Tobias Lidskog
1779d75693
Fix spinning crawl when using maxPages.
...
Turns out the 'complete' event wasn't being sent when the parser was explicitly stopped.
2016-05-13 21:55:51 +02:00
Tobias Lidskog
315ae102e1
Implement crawler.maxPages to limit pages in crawl
2016-05-13 18:16:35 +02:00
soulgalore
616dbab278
skip links in HTML comments #896
2016-05-13 08:09:12 +02:00
Tobias Lidskog
06e9933db4
Rename crawler.maxDepth to crawler.depth.
2016-05-10 22:06:29 +02:00
Tobias Lidskog
dad7546e95
Filter out non-html pages from crawler.
2016-05-10 22:06:29 +02:00
Tobias Lidskog
e52b4a8503
Make url crawl much more functional.
2016-05-08 07:32:53 +02:00
Tobias Lidskog
551ef59297
Skip crawl if depth is 0 or 1.
2016-05-08 07:32:53 +02:00
soulgalore
9842b57831
debug log each URL the crawler finds
2016-04-25 14:51:45 +02:00
soulgalore
92503dc909
more logs
2016-04-14 11:47:58 +02:00
Tobias Lidskog
c840d65c55
Initial draft of node based crawler.
2016-03-23 00:39:46 +01:00